Python RegEx – Regular Expression Tutorial With Examples

By Sruthy

By Sruthy

Sruthy, with her 10+ years of experience, is a dynamic professional who seamlessly blends her creative soul with technical prowess. With a Technical Degree in Graphics Design and Communications and a Bachelor’s Degree in Electronics and Communication, she brings a unique combination of artistic flair…

Learn about our editorial policies.
Updated March 7, 2024

This Python RegEx tutorial explains what is Regular Expression in Python and how to use it with the help of programming examples:

Searching characters in a string are one of the most common tasks while working with strings in any programming language. In Python, we can use the equality operator(==) to match two strings to see if they are equal. We can also use the in operator to find out if a string contains a specific substring.

Also, in order to get the position of the characters we are looking for, Python has the .find() and .index() built-in string methods that we can use.

=> Check ALL Python Tutorials Here

In most text editors, we can search for a word by pressing CTRL-F and typing the word you are looking for.

For example, say you want to match any phone number in a text. Using the techniques above will require you to have a list of all phone numbers you want to match. In practice, this is not robust. This is where Regular Expression comes in.

In this tutorial, we shall look at this game-changer called Regular Expression or Regex in short.

What is Python RegEx

Python Regular Expression (1)

Before we actually look at regular expressions, let’s see how we can find a phone number in a string without the use of a regular expression.

Say we want to find a US phone number that has the following patterns:

  • Optional country code, in this case, USA code; +1
  • Optional area code, in this case, Washington local code; 202
  • A hyphen and 3 digit numbers.
  • Another hyphen and 4 digit numbers.
  • Have a total of 7 numbers without country and area code.

Using the patterns above, we have: 202-555-0164 to be a valid Washington local phone number and 202555,0164 to be invalid. Let’s write a piece of Python code to find any Washington local phone number in a text.

Example 1: Validate Washington phone number without Regex.

def isWashingtonPhoneNumber(text):
    # check that length is 12
    if len(text) != 12:
        return False
    # check that the country code exist 
    if text[0:3] != '202':
        return False
    # check for hyphen
    if text[3] != '-':
        return False
    # check that the characters after first hyphen are numeric characters
    for i in range(4, 7):
        if not text[i].isdecimal():
            return False
    return True
    # check for hyphen
    if text[7] != '-':
        return False
    # check that the characters after the hyphen are numeric characters
    for i in range(8, 12):
        if not text[i].isdecimal():
            return False
    return True


if __name__ == '__main__':
    phone1 = '202-555-0164'
    phone2 = '202555,0164'

    print("{} is valid: {}".format(phone1, isWashingtonPhoneNumber(phone1)))
    print("{} is valid: {}".format(phone2, isWashingtonPhoneNumber(phone2)))

Output:

Validate Washington phone number without Regex

The above code works perfectly fine and it was able to validate our phone numbers. Now if we want to run this function on a text like.

"Hello, thank you for your help. You can reach my cell phone at 202-555-0164 and via email at; petter.cool@gmail.com"

Then, we can do it like this.

Example 2: Find valid Washington phone numbers in a text.

Save the code at example 1 above in a file and name it as valid.py. Open a new file on your editor in the same directory and paste the following code.

import re 
from valid import isWashingtonPhoneNumber

def find_in_text(text):
    for i in range(len(text)):
        chunk = text[i:i+12]
        if isWashingtonPhoneNumber(chunk):
            print("Found valid phone number: {}".format(chunk))
    print("Finished!")


if __name__ == '__main__':
    text = "Hello, thank you for your help. You can reach my cell phone at 202-555-0164 and via email at petter.cool@gmail.com"
    find_in_text(text)

Output:

Regular expression

The code above goes through the text and extracts each chunk of 12 characters which is passed to our isWashingtonPhoneNumber function to determine if it is a valid Washington number. The first chunk is “Hello, thank“, and the second chunk is “ello, thank ”, and so on until it will extract any valid phone numbers.

The examples above work fine but here are some limitations. It uses a lot of code to validate a single pattern. What about other valid patterns like (202)555-0164, +1-202-555-0164? In order to adapt our code to validate these extra patterns, we will have to write extra lines of code. But there is a better and easier way.

Regular Expression also called regex is a special sequence of characters that defines a pattern for complex string-matching functionality. Before we get into the various components of regex, let’s solve the problem described above.

Example 3: Detect valid phone numbers with regex.

import re # import the re module

def find_with_regex(regex, text):
    matches = []
    # fine all matching patterns 
    for group in regex.findall(text):
        phoneNum = ''.join(group)
        matches.append(phoneNum)
        
    print("All Valid phone numbers are: ")
    print(matches)

if __name__ == '__main__':
    # define regex patterns
    regex = re.compile(r"(\+1-)?(\(?202\)?)(-)?(\d{3})(-)(\d{4})\b")
    text = "Thank you for your support. You can reach me at: 202-555-0164 or (202)555-0164 or +1-202-555-0164"

    find_with_regex(regex, text)

Output

Valid phone numbers with Regex

Python Regular Expression Patterns

Regex in Python is rich with patterns that can enable us to build complex string-matching functionality.

Before we dive deep, let’s analyze a regex object’s search() method that is built in the re module. It searches through the string to locate the first occurrence that matches the regular expression pattern and returns a None if no match is found or a match object if any match is found.

The returned match object has a group() method that can be used to obtain the actual match text. We shall see more of the re module methods in the section below.

Character Classes

Given below are the most common character classes in Python.

Table 1: Character Classes

Character ClassDescription
.Also called wildcard character. It matches any character except a newline
\wMatches letter, numeric digit and underscore character; [a-zA-Z0-9_]
\WMatches anything that is not a letter, numeric digit and an underscore character; [^a-zA-Z0-9_]
\dMatches numeric digits; [0-9]
\DMatches non-numeric digits; [^0-9]
\sMatches a whitespace character; [\t\r\n]
\SMatches a nonwhitespace character; [^\t\r\n]

Let’s explain with examples what these character classes are.

#1) The Wildcard character

The Wildcard character in Python regex is represented by the dot(.) character. It matches any character except a newline.

Example 4: Match all except newline character(\n)

>>> atrex = re.compile(r'.at')
>>> atrex.findall('my cat is \nat rest on the mat')
['cat', 'mat']

#2) \w, \W

\w matches any letter, numeric digit, or underscore character. It is similar to matching everything in [a-zA-Z0-9_].

\W is the opposite of \w i.e. it matches any character that is not a numeric digit, a letter, or the underscore.

Example 5: Match the word and nonword character.

>>> re.search('\w', '(*.)$_#%') # match word character 
<re.Match object; span=(5, 6), match='_'>
>>> re.search('\W', 'abe_#32') # match nonword character
<re.Match object; span=(4, 5), match='#'>

#3) \d, \D

\d matches any numeric digit from 0 to 9.

\D is the opposite. It matches any character that is not a numeric digit from 0 to 9.

Example 6: Match digit and non-digit characters.

>>> re.search('\d', 'abe_#3)') # match digits
<re.Match object; span=(5, 6), match='3'>
>>> re.search('\D', '1234d8') # match non-digits
<re.Match object; span=(4, 5), match='d'>

#4) \s, \S

\s matches any space, tabs, and newline character, also known as whitespace.

\S matches any character that is not space, tab, and newline.

Example 7: Match whitespace and non-whitspace characters

>>> re.search('\s', 'hello\nworld') # match whitespace
<re.Match object; span=(5, 6), match='\n'>
>>> re.search('\S', ' \n \t 9 \n') # match nonwhitespace
<re.Match object; span=(5, 6), match='9'>

Grouping with Parentheses

It is possible to group patterns in regex so that specific conditions can be applied to those groups. Also, this makes it possible to use the group() match object method to retrieve specific groups of the matching texts.

For example, consider the regex to match our Washington phone number with area code; (\d\d\d)(-)(\d\d\d)(-)(\d\d\d\d). This will match 202-555-0164 and other similar patterns.

But now, with grouping, we can separate the country code; 202 from the rest of the numbers. The first set of parentheses represents group 1, the second set will be group 2, and so on.

Example 8: Grouping patterns in regex

import re

def grouping():
    regex = '(\d\d\d)(-)(\d\d\d)(-)(\d\d\d\d)'
    text = 'My phone number is: 202-555-0164'
    regexResult = re.search(regex, text)

    print("Groups as single string: ", regexResult.group()) # return all groups as single string

    print("First group | Area code: ", regexResult.group(1)) # first group gets the area code

    print("Second group | First hyphen: ", regexResult.group(2)) # second group gets the first hyphen

    print("Groups as tuple: ", regexResult.groups()) # groups() returns all groups as tuple

if __name__ == '__main__':
    grouping()

Output

Grouping with Parentheses

Note that, if we try to access a group that doesn’t exist, then an IndexError exception will be raised. For example, the above example has just two groups. Trying to access group(3) will raise an exception.

Quantifier

A quantifier in this case determines how many times a regex pattern should match successfully.

Table 2: Quantifiers

Quantifier Description
*Matches zero or more of the preceding regex pattern
+Matches one or more of the preceding regex pattern
?Matches zero or one of the preceding regex pattern
{m, n}Matches m to n instances of the preceding regex pattern
{m}Matches exactly m instances of the preceding regex pattern
{m, }Matches m to more instances of the preceding regex pattern
{, n}Matches zero to n instances of the preceding regex pattern

Let’s explain with examples as what these quantifiers are.

#1) Matching zero or more with Star(*)

Matches zero or more of the preceding regex pattern.

Example 9: Match zero or more ‘o’ in a string

>>> re.search('fao*h', 'fah')
<re.Match object; span=(0, 3), match='fah'>
>>> re.search('fao*h', 'faoh')
<re.Match object; span=(0, 4), match='faoh'>
>>> re.search('fao*h', 'faoooh')
<re.Match object; span=(0, 6), match='faoooh'>

In the example above, we have the various texts to match; fah, faoh, and faoooh. The first text has no ‘o’, the second has one ‘o’, and the last has three ‘o’. Each match because we match zero or more of the character ‘o‘ in the pattern.

#2) Matching one or more with Plus(+)

Matching zero or more of the preceding regex pattern. The pattern preceding the plus must appear at least once.

Example 10: Match one or more ‘o’ in a string

>>> print(re.search('fao+h', 'fah'))
None
>>> re.search('fao+h', 'faoh')
<re.Match object; span=(0, 4), match='faoh'>
>>> re.search('fao+h', 'faoooh')
<re.Match object; span=(0, 6), match='faoooh'>

The text ‘fah’ in the first line of code doesn’t match because at least one ‘o’ is required by the plus sign.

#3) Optional Matching with the Question Mark(?)

Matching zero or one of the preceding regex patterns. This is used when we have a pattern that we want to match only optionally.

It is different from Star(*) in that, it should not appear or it should appear only once.

Example 11: Match zero or a single ‘o’ character

>>> re.search('fao?h', 'faoh')
<re.Match object; span=(0, 4), match='faoh'>
>>> re.search('fao?h', 'fah')
<re.Match object; span=(0, 3), match='fah'>
>>> print(re.search('fao?h', 'faooh'))
>>> None

The last line of the code above doesn’t return a match because ‘o’ which precedes the question mark(?) sign neither appears once nor none of the time.

#4) Matching Specific Repetitions with Curly Brackets

If we have a pattern that we want to repeat a specific number of times, then we can use the curly bracket to specify a number or range of numbers in which the pattern should be repeated. This can be represented in many ways.

a) {m}

The regex (yah){m} will match exactly m instances of the (yah) group. For example, (yah){3} will match yahyahyah.

Example 12: Match an exact instance of a group

>>> re.search('(yah){3}','yahyahyahyahyahyah')
<re.Match object; span=(0, 9), match='yahyahyah'>
>>> num = re.search('(\d{3}) (\d{9})','Some random numbers: 237 678910225')
>>> num.groups()
('237', '678910225')

b) {m, n}

The regex (yah){m,n} will match m to n instances of the (yah) group. That is, (yah){2,4} will match yahyah, yahyahyah, and yahyahyahyah. Note that, this regex always returns the longest string possible in ambiguous situations. This is because they are greedy by default.

Example 13: Match 2 to 4 instances of the group of characters (yah)

>>> re.search('(yah){2,4}','yahyahyahyahyahyah')
<re.Match object; span=(0, 12), match='yahyahyahyah'>

c) {m,}

The regex (yah){m, } will match m or more instances of the (yah) group.

Example 14: Match 3 or more instances of the group of characters (yah)

>>> re.search('(yah){3,}','yahyahyahyahyahyah')
<re.Match object; span=(0, 18), match='yahyahyahyahyahyah'>

d) {,n}

The regex (yah){, n} will match zero to n instances of the (yah) group.

Example 15: Match 0 to 5 instances of the group of characters (yah)

>>> re.search('(yah){,5}','yahyah')
<re.Match object; span=(0, 6), match='yahyah'>

Anchors

Anchors are used for detecting a particular location where a match should occur rather than matching particular characters.

#1) The Caret and Dollar Sign Characters

The caret (^) symbol or \A is used at the start of a regex pattern to indicate that a string must start with this regex pattern for a match to be successful. On the other hand, the dollar($) sign or \Z is used at the end of a regex pattern to indicate that a string must end with this regex pattern for a match to be successful.

Example 16: Mark the start and end of a regex pattern

>>> re.search('^Good', 'Good morning')
<re.Match object; span=(0, 4), match='Good'>
>>> re.search('^Good', 'Hello, Good evening') == None
True
>>> re.search('evening$', 'Good evening')
<re.Match object; span=(5, 12), match='evening'>

#2) Boundary Matching

\b matches before and after a sequence of alphanumeric or underscore characters, which is also represented by \w or [a-zA-Z0-9_]. It allows us to match “word boundaries”. That is, for example, consider \bis\b to be a regex. If the text “hello this is the final one” is used. Then, based on the word boundaries, only the second ‘is‘ will match.

This is because the ‘i‘ in ‘is’ is a word boundary. Meaning, before it, there is no alphanumeric or underscore character. Same for ‘s‘ in ‘is‘.

But for the first ‘is‘ in the text, we see that before ‘i‘, there is ‘h‘ which is alphanumeric. Consider the diagram below.

Demonstrating Boundary Matching
Demonstrating Boundary Matching

\B is the opposite of \b. It matches at any position where \b does not.

Example 17: Match word and nonword boundaries

>>> re.search(r'\bis\b', 'hello this is the final one')
<re.Match object; span=(11, 13), match='is'>
>>> re.search(r'\Bis\b', 'hello this is the final one')
<re.Match object; span=(8, 10), match='is'>

Few things to note here:

  • We used the raw string format by putting an ‘r‘ before the first quote of the string value. This doesn’t escape characters, thereby helping us to write ‘\b‘ instead of ‘\\b
  • In the first regex, the ‘is’ matches at index 11, and in the second regex, the ‘is’ matches at index 8.

Escaping Metacharacters

Metacharacters are building blocks of regular expressions. So, far, we have covered some of the Metacharacters like; Quantifiers, Anchors and Character Classes, Groups. In this section, we shall look at the other metacharacters and see how to escape them.

a) [ ]

The square brackets allow us to define our own character classes. At times, we may want to define a specific pattern that can’t be handled by the character classes(\d, \w, \s, etc). For example, say we want to match all lowercase vowels and the numbers 3,4,6. We could define our class like; [aeiou346].

With square brackets, we can do other cool things like specifying a range of letters or numbers using the hyphen.

For example, [a-zA-z0-9] will match all letters lowercase and uppercase from a to z and numbers from 0 to 9. Also, we don’t need to escape other metacharacters inside the square brackets. For example [abc.*] will match letters a,b,c; the period and, the star(*) symbol.

Example 18: Define a character class

>>> rex = re.compile(r'[abc.*]')
>>> rex.findall('jab.c*')
['a', 'b', '.', 'c', '*']

b) Conditional matching with the Pipe

A Pipe is represented by the | character in Python regex. It is used to match one of many patterns. For example, the regex r’one|two’ will match either ‘one’ or two’. If both patterns appear in the text, then the first pattern will be returned as the match object.

Escaping with backslash (\)

We have seen many metacharacters so far. Now, escaping these metacharacters is important for the following reason. Imagine our text has any of these metacharacters and we want to use a regex in order to match them literally.

Example 19: Escaping metacharacters

>>> re.search(r'(237)', '(237) 777777')
<re.Match object; span=(1, 4), match='237'>
>>> re.search(r'\(237\)', '(237) 777777')
<re.Match object; span=(0, 5), match='(237)'>

In our text, we have the parentheses ( ) which is the metacharacter for grouping. The way our regex uses it in the first line of code indicates that we want to match a group of numbers; 123. This won’t match the parentheses as intended.

In the second line of code, we used the backslash(\) to escape these parentheses so that our regex will treat them as normal characters to be matched literally.

Lookahead and Lookbehind Assertions

lookahead and lookbehind, also collectively called ‘lookaround’ are assertion patterns. They assert whether a match is possible or not but do not return the match.

Let’s say we have phone numbers with their country codes. We want to filter out phone numbers with a specific area code, but we just want to use the numbers without the area code. The lookaround assertion helps us achieve just that.

Positive and Negative lookahead

Positive lookahead is indispensable if we want to match a pattern that must be followed by another pattern. It is represented by the regex (?=<lookahead_regex>). For example, the regex a(?=b) matches an a that is followed by a b.

Negative lookahead is the opposite of the Positive lookahead. It matches a pattern that is not followed by another pattern. It is represented by the regex (?!<lookahead_regex>). For example, the regex a(?!b) matches an a that is not followed by a b.

Example 20: Match numbers with a specific extension

Say we have a list of IDs 2359PI, 4539PI, 4383MI, 2008MI. Each number has an extension which is either PI or MI. Now, we are only concerned about the numbers and not the extension, but we want to get all numbers with the extension PI.

>>> import re
>>> ids = "2359PI, 4539PI, 4383MI, 2008MI"
>>> regexId = re.compile(r'\d+(?=PI)')
>>> regexId.findall(ids)
['2359', '4539']

Positive and Negative Lookbehind

Lookbehind has the same effect but works backward to lookahead.

Positive lookbehind matches a pattern that must be preceded by another pattern. It is represented by (?<=<lookbehind_regex>). For example, the regex (?<=e)t matches the character, t in set, bet, but not in bat, cat.

Negative lookbehind is the opposite of the Positive lookbehind. It matches a pattern that is not preceeded by another pattern. It is represented by the regex (?<!<lookbehind_regex>).

Example 21: Match any phone number that is preceeded by the country code: +1, and follows a US specific format.

>>> import re
>>> text = "My US number is: +1-202-555-0184 , and my Nigeria number is: +23484757664"
>>> re.search(r'(?<=\+1-)\d{3}-\d{3}-\d{4}', text)
<re.Match object; span=(20, 32), match='202-555-0184'>

The re Module

In Python, all regex functions are built in the re module. We have seen some of this module’s methods. However, in this section, we shall explore some of the methods.

#1) re.compile()

This method takes in a string representing our regex and returns a regex object which can be used for matching using all re module methods like match(), search(), findall(), etc.

It also permits us to save regex objects and use them in other parts of the program.

We have seen this in action in example 20 above.

#2) re.search(pattern, string[, pos[, endpos]])

This method checks through a string and returns an object representing the first location where a regex produces a match. If no match is found, then None is returned.

It can be called from the re module directly or from a regex object output of the re.compile() method. If called directly, its first argument should be a regex pattern. It also takes in two optional arguments i.e. pos, endpos which determine an index where the search is to start and stop respectively.

We have seen this in action in almost all the examples above.

#3) re.findall(pattern, string[, pos[, endpos]])

This method scans a string and returns a list of all matches to the regex pattern. Just like re.search(), it takes in three arguments and can also be called from a regex object returned by re.compile().

We have seen this in action in examples 3, 17, 10, and 20.

Frequently Asked Questions

Q #1) What is a Regular expression in Python?

Answer: Regular Expression also called regex is a special sequence of characters that defines patterns for complex string-matching functionality.

Q #2) What’s the difference between ? and * in a regular expression?

Answer: In regex, the star(*) symbol matches zero or more instances of the preceding expression while the question mark(?) symbol matches zero or one instance of the preceding expression.

The question mark(?) symbol is commonly used when we want to match only optionally. However, these two patterns match zero instances of the preceding expression.

Q #3) What does re mean in Python?

Answer: In Python, re is a build-in module that provides Perl-like regular expression operations. It is fully documented in the Python official documentation

Q #4) Are Regular Expressions useful?

Answer: Regex is a modern and sophisticated way to search and replace sub-strings that match a pattern. Without regex, these operations will otherwise be strenuous and take a lot of lines of code.

Modern text editors, IDEs, and word processors have the find, find-and-replace features that uses regular expressions to search specific patterns in text.

Q #5) What does Findall return Python?

Answer: Unlike the re.search() function that returns the first location where a match occurs, the re.findall() function returns a list of all matches to a regex pattern.

Q #6) Can you create your own character class?

Answer: In Python, character classes are made up of \w, \d, \s and their counterparts \W, \D, and \S respectively.

In Python, we can use the square brackets [ ] metacharacter to create our own customized character class to suit our needs. For example, [0-5a-d] will match only the numbers 0 to 5 and letters a to d.

Conclusion

In this tutorial, we looked at what Regular Expressions or Regex are in Python. We saw examples of finding texts in a string with and without the use of regex and saw how indispensable regex is.

We also looked at different aspects of regex like character classes, grouping with parentheses, quantifiers, anchors, etc. We treated some more advanced aspects like lookahead and lookbehind assertions and finally, we examined a few methods of the re module.

=> Check Out The Perfect Python Training Guide Here

Was this helpful?

Thanks for your feedback!

Leave a Comment