Python Pattern Matching

Python pattern matching:

If you want to use regular expressions (sometimes called regex), you use the re module in Python. Pattern matching is very important when handling user input.

Here is a simple example of pattern matching for getting a first name from a user. The name must be 2-10 letters long. The first letter can be uppercase or lowercase, and the remaining 1-9 characters must be lowercase. It can’t be numbers or punctuation.

import re

first_name = input(“Please enter your first name: “)

name_pattern = re.compile(“^[A-Za-z]{1}[a-z]{1,9}$”)

name_matches = name_pattern.match(first_name)

if name_matches:

print(“Your name is acceptable!”)

else:

print(“Invalid name.”)

In the above pattern matching example, ^ means the beginning of the string, [A-Za-z] means either uppercase or lowercase letters, and {1} means only one character for the previous []. So that means that, at the beginning of the string, there needs to be either an uppercase or lowercase letter. Then, [a-z] means any lowercase letter, and {1,9} means 1-9 of the previous pattern in brackets. In this case, that means 1-9 lowercase letters. Finally, $ means the end of the string. So in total, the pattern we’re searching for is an uppercase or lowercase letter followed by 1-9 lowercase letters. If it’s anything else, the name_pattern.match() method will return false. Pattern matching is important when getting user input, because they might not enter what they’re supposed to. This could be due to a harmless accident, or even malicious intent from a hacker who is trying to misuse the code you write that takes user input. You can never trust user input, so pattern matching with regular expressions is crucial.

A common mistake is to let your pattern matching be overly permissive. It will pass tests of accepting valid inputs, but it will also allow things that shouldn’t be accepted. This phenomenon is called overly permissive regular expressions and you can find information about it on security-centric sites such as the OWASP wiki.

You might think “well nobody’s going to hack my app, so who cares?” but that’s a wrong way to think about it. It’s true that a vast majority of the users of your site or app will be nice, but a few bad people ruin it for everyone. All it takes is one bad person to hack your site and perform code injection that can lead to things like a data breach or a reverse shell where they can execute arbitrary commands, possibly using your server to launch attacks against others.

It doesn’t matter if 99.9999% of your software’s users are nice. Your code needs to be prepared to account for the 0.0001% who will try to hack you. And many attacks are automated with software rather than being carried out manually.

Another concept related to user input is escaping input. If you’ve ever seen a URL with something like %20 in it, that’s escaping input. That particular kind of escaping is called percent encoding, and %20 means space. But the reason why this is done is so that user input is treated as data rather than code, because programming languages use certain characters as syntax. If someone’s last name is O’Brian, they might notice that some sites won’t accept their last name because they don’t want someone to attempt any code injection. A string input might be delimited with ‘ at the start and end, so O’Brian could end the string after the O and possible do code injection afterwards, which is why many simpler input validation systems might discard it entirely and disallow it. Also, keep in mind that input validation must be performed server-side rather than client-side, as client-side software can be modified to get rid of the client-side validation, such as by disabling JavaScript in a browser or using a tool like Tamper Data (or even just the developer tools built into browsers).

Some sites can accept things like an apostrophe in a last name like O’Brian, because they perform input escaping rather than just discarding things with punctuation. Then you can have O’Brian as a name, but in the database it might be represented with %27, which is the url encoding for an apostrophe. This way, there is a distinction between the string data and code to be executed. However, even escaped input can sometimes be used for deserialization attacks, when data is taken from a byte stream and turned into an object state. For example, I once attended a workshop about SQL injection and I managed to do something above and beyond the scope of the workshop by putting javascript in the database that would run when the user would view a certain page that fetched data from the database. Something being safe in a database doesn’t necessarily mean it’ll be safe when it’s shown to the user.

← Previous | Next →

Python Topic List

Main Topic List

Leave a Reply

Your email address will not be published. Required fields are marked *