Regular Expressions

Regular expressions a.k.a. regex – if you want to do pattern matching, you will do it with regular expressions, or regex. Let’s say you want the user to enter an email address. You’d want to make sure the address is alphanumeric characters, maybe 1-10 characters long, followed by an @ symbol, followed by an alphanumeric domain name, maybe 1-10 characters long as well, followed by .com or .net or .org. You would use regex to verify that it matches.

You can think of regex as a mini-language, just like how SQL is for databases and not much else. There are some general purpose languages, like Java, but regex and SQL are very specific in their uses.

Here is an example of regex for email:

/^[a-zA-Z0-9]{1,10}\@[a-zA-Z0-9]{1,10}\.(com|net|org)$/

That looks very confusing, doesn’t it? But let’s break it down.

Forward slashes often surround regular expressions.

^ is the beginning anchor, meaning starting at the beginning. Do this if you want to match stuff starting at the beginning of a string, but know that sometimes you want to search for any string in any position, so this isn’t useful for everything.

[a-zA-Z0-9] means match something that is lowercase, uppercase, or a number — nothing else.

The {1,10} means a length of 1-10 characters long for the aforementioned alphanumeric string.

\ is used for escaping characters. It’s not necessary for @, but for other characters which have syntactic meaning, you will need a backslash for them. It means “treat the following character as text, not code.”

@ is the character for email. [email protected]. In this case, you want the literal character @.

The next part, [a-zA-Z0-9]{1,10}, is like before, but this time for the second level domain rather than the user account portion of the email — 1-10 characters for alphanumeric.

Then we have \., which is an escaped period, meaning it’s an actual period rather than a syntactic period. This is for the period in something like gmail.com.

Next is the final part – the top level domain. I kept it simple and listed only three – com, net, and org. It has to be one of those three. The () separate it from other parts, and the | means or. So it means com, net, or org.

Lastly, $ is the end anchor, meaning nothing more can come after it. If you want to find something that ends in A, you could use something simple like this:

/A$/

regex101.com is an excellent site for learning and testing regular expression. It’s hard at first, but over time, you’ll get the hang of it.

Here are some examples of things that would match the regex I listed (none of these are real email addresses):

[email protected]

[email protected]

[email protected]

Here are some that would not match:

alangmail.com

alan@gmailcom

@.com

somethingATsomething.com

[email protected]

[email protected]

[email protected]

Congratulations on completing section 3!

You’ve mastered intermediate concepts which are vital for becoming a software developer!

← Previous | Next →

Intermediate CS Topic List

Main Topic List

Leave a Reply

Your email address will not be published. Required fields are marked *