Pandas String and DateTime Operations for Real-World Data
Real-world data is messy. Names arrive in ALL CAPS, dates come as random text strings like "Jan 5, 2024" or "2024/01/05", and phone numbers have dashes in some rows but not others. Before you can analyze anything, you need to clean this stuff up.
Pandas has two powerful accessories for this job: the `.str` accessor for string operations and the `.dt` accessor for datetime operations. Think of them as specialized toolboxes that attach to any Series of strings or dates.
In this tutorial, you'll learn how to clean text data, extract patterns from strings, convert messy dates into proper datetime objects, and pull out components like year, month, and day of the week.
How Do You Clean Text with the .str Accessor?
When a column contains strings, you can use .str followed by any Python string method. The method runs on every value in the column at once -- no loops needed.
You can chain .str methods together just like regular Python string methods. Each step produces a new Series, so the original data stays untouched.
How Do You Search and Replace in String Columns?
The .str.contains() method checks if each string matches a pattern. It returns a boolean Series that you can use as a filter. For replacing text, .str.replace() works like Python's str.replace() but on the whole column.
How Do You Split Strings and Extract Parts?
.str.split() splits each string into a list. You can expand it into separate columns with expand=True. For more precise extraction, .str.extract() uses regex capture groups to pull out specific patterns.
How Do You Convert Strings to Dates?
The pd.to_datetime() function is your go-to tool for turning messy date strings into proper datetime objects. It is surprisingly smart and can automatically parse many common formats.
When the automatic parser can't figure out your format, you can specify it explicitly using format=. This is also faster because Pandas doesn't have to guess.
How Do You Extract Date Components?
Once you have a datetime column, the .dt accessor lets you pull out individual components like year, month, day, day of the week, and more. This is incredibly useful for time-based analysis.
How Do You Do Math with Dates?
You can subtract dates to get a timedelta (duration), add timedeltas to shift dates, and create date ranges with pd.date_range(). This is the foundation of time series analysis.
Practice Exercises
Clean the names Series by stripping whitespace and converting to title case. Print the cleaned names as a list.
Filter the DataFrame to keep only rows where the email column contains "gmail". Print the name column of the filtered rows as a list.
What will this code print?
import pandas as pd
s = pd.Series(['hello world', 'foo bar', 'python pandas'])
print(s.str.split(' ').str.get(0).str.upper().tolist())Convert the date_str column to datetime, then add a month column with the month number. Print the month column as a list.
Add a column days_to_deliver that shows how many days passed between order_date and delivery_date. Print the column as a list of integers.
This code tries to extract area codes from phone numbers, but the regex pattern is wrong. Fix the str.extract() call so it correctly pulls out the 3-digit area code from the parentheses.
Print the extracted area codes as a list.