Python Regular Expressions: Pattern Matching from Basics to Advanced
Imagine you're working at a help desk and need to find every email address buried in thousands of support tickets. You could scan each line character by character, but that would take forever. Regular expressions let you describe the pattern of an email address, and Python finds every match for you.
Regular expressions (often called regex or regexp) are a mini-language for describing text patterns. Python's built-in re module gives you functions to search, extract, replace, and split text using these patterns.
In this tutorial, you'll learn to use re.search(), re.findall(), re.sub(), character classes, quantifiers, groups, and compiled patterns. By the end, you'll be able to validate input, extract data, and clean messy strings with confidence.
Why Do Regex Patterns Use Raw Strings?
Before diving into patterns, there's one important habit: always write regex patterns as raw strings by putting an r before the opening quote, like r'\d+'. Without the r, Python interprets backslashes as escape characters (like \n for newline) before the regex engine ever sees them.
import re
# \b is a backspace in normal strings!
pattern = '\\bcat\\b'
print(re.search(pattern, 'the cat sat'))import re
# r prefix keeps backslashes literal
pattern = r'\bcat\b'
print(re.search(pattern, 'the cat sat'))How Do re.search() and re.match() Work?
re.search(pattern, string) scans the entire string and returns the first match it finds (or None if there's no match). re.match(pattern, string) only checks at the beginning of the string. Most of the time, re.search() is what you want.
When re.search() finds a match, it returns a Match object. Call .group() to get the full match, or .group(1) to get the first captured group (the part inside parentheses). If there's no match, it returns None, so always check before calling .group().
What Are Character Classes and Quantifiers?
Character classes define which characters to match. Quantifiers define how many of them to match. Together, they form the backbone of every regex pattern.
Here's a quick reference for the most useful shorthand classes:
\d -- any digit (0-9)\D -- any non-digit\w -- any word character (letters, digits, underscore)\W -- any non-word character\s -- any whitespace (space, tab, newline)\S -- any non-whitespace. -- any character except newlineHow Do You Find All Matches or Replace Text?
re.findall() returns a list of all non-overlapping matches. re.sub() replaces every match with a new string. These two functions handle the most common regex tasks: extracting data and cleaning text.
Notice that when findall() uses capturing groups (parentheses), it returns only the captured part, not the full match. Without groups, it returns the full match. This is a common source of confusion.
How Do Groups Let You Extract Specific Parts?
Parentheses () create capturing groups that let you extract specific pieces of a match. Think of them as highlighting the parts you care about inside a larger pattern.
How Do You Split Strings with re.split()?
Python's built-in str.split() only splits on a fixed separator. re.split() splits on a pattern, which is much more flexible. Need to split on commas, semicolons, and pipes all at once? Regex makes that easy.
The maxsplit parameter is handy when you only want to split at the first occurrence. In the log example above, we split only at the first colon so the rest of the message stays intact.
What Are Some Common Regex Patterns?
Here are patterns you'll use over and over. Each one solves a common real-world task:
Practice Exercises
Write a function find_phone(text) that uses re.search() to find the first phone number in the format XXX-XXX-XXXX in the given text. Return the phone number as a string, or "Not found" if no phone number exists.
Write a function extract_hashtags(text) that returns a list of all hashtags in the text. A hashtag starts with # followed by one or more word characters (letters, digits, or underscores). Return just the tag names without the # symbol.
What will this code print? Think carefully about greedy vs lazy matching and what findall returns with groups.
Write a function censor_cards(text) that replaces all sequences of 4 groups of 4 digits separated by dashes (like 1234-5678-9012-3456) with ****-****-****-XXXX, where XXXX is the last 4 digits. Use re.sub() with a function or groups.
This email validator has two bugs. Find and fix them so it correctly validates that a string looks like an email address (one or more word characters/dots/hyphens, then @, then a domain, then a dot, then 2-4 letters).
Write a function parse_log(entry) that extracts the timestamp, level, and message from a log entry formatted as "[HH:MM:SS] LEVEL: message". Return a dictionary with keys "time", "level", and "message". Use named groups. If the entry doesn't match, return None.
Refactor this messy string-cleaning code to use re.sub(). The function should: (1) replace all runs of multiple spaces/tabs with a single space, and (2) strip leading and trailing whitespace. The current code uses multiple chained .replace() calls -- make it cleaner with one regex substitution plus .strip().