Build an Automated Data Cleaning Pipeline
Real-world data is messy. Names are spelled differently, fields are missing, dates are in three different formats, and somehow someone entered "N/A" as their age. Before you can analyze data, you need to clean it.
In this project, you'll build a complete data cleaning pipeline step by step. By the end, you'll have a reusable function that takes messy data in and spits clean data out. This is the kind of code data analysts write every single day.
Step 1: Load the Messy Data
Every data cleaning project starts with loading the data and taking a first look. Our dataset is a customer list from a fictional online store. It has all the typical problems you'd see in real data pulled from a database or CSV export.
Take a moment to spot the problems. There are inconsistent name casings, duplicate rows, non-numeric ages, impossible ages, missing values, extra whitespace, and dates in different formats. A typical Tuesday for a data analyst.
Create the messy DataFrame from the provided dictionary and print a summary showing:
1. The number of rows and columns
2. The count of missing values per column
Print the output in this exact format:
Rows: 10, Columns: 5
Missing values:
name 2
email 1
age 0
signup_date 1
purchase_total 1Note: The name column has one None value and one whitespace-only value " Eve Adams " is NOT missing (it has content). However, the None counts as 1 missing. Look at the data carefully.
Step 2: Identify and Handle Missing Values
Missing data is the most common problem you'll encounter. There are two main strategies: drop the rows with missing values, or fill them with something sensible. The right choice depends on the column and how much data is missing.
For our dataset, we'll drop rows where the name is missing (we can't identify a customer without a name) and fill missing purchase totals with 0.
Write a function handle_missing(df) that:
1. Drops rows where name is missing (None/NaN)
2. Drops rows where email is missing
3. Fills missing purchase_total values with 0.0
4. Fills missing signup_date values with "Unknown"
5. Returns the cleaned DataFrame
Then call it and print how many rows remain.
Step 3: Remove Duplicate Rows
Duplicate rows inflate your analysis. If Alice appears twice, her purchase total gets counted twice. Pandas makes deduplication easy with drop_duplicates(). You can choose which columns determine whether two rows are "the same."
Write a function remove_duplicates(df) that:
1. Prints how many duplicate rows were found
2. Removes duplicates based on the name and email columns (keep the first occurrence)
3. Resets the index (with drop=True)
4. Returns the cleaned DataFrame
Print format: Duplicates found: X
Step 4: Fix Data Types
Sometimes columns that should be numbers contain text values. The age column in our dataset has entries like "thirty-two" mixed in with real numbers. We need to convert these to proper numeric types, handling the invalid entries gracefully.
Write a function fix_data_types(df) that:
1. Converts the age column to numeric (coercing errors to NaN)
2. Converts the purchase_total column to numeric (coercing errors)
3. Fills any NaN values in age with the median of the valid ages
4. Rounds the filled age values to whole numbers (use .round(0)) then converts to int
5. Returns the cleaned DataFrame
Print the age column after cleaning.
Step 5: Standardize Text Fields
Text inconsistency is sneaky. "Alice Smith", "alice smith", and " ALICE SMITH " are all the same person, but your code treats them as different values. Standardizing text means picking one format and applying it everywhere.
Write a function standardize_text(df) that:
1. Strips leading/trailing whitespace from the name column
2. Converts name to title case (first letter of each word capitalized)
3. Strips whitespace from the email column
4. Converts email to lowercase
5. Returns the cleaned DataFrame
Print the name and email columns after cleaning.
Step 6: Handle Outliers
Outliers are values that are technically valid numbers but make no sense in context. An age of -5 or 150 is a valid integer, but not a valid human age. We need to define reasonable bounds and deal with values outside them.
There are two common approaches: clip the values to the boundary (so 150 becomes the max valid age), or replace them with NaN and fill later. Clipping is simpler and preserves all your rows.
Write a function handle_outliers(df) that:
1. Clips the age column to be between 0 and 120 (inclusive)
2. Clips the purchase_total column to be between 0 and 10000
3. Prints how many values were clipped in each column
4. Returns the cleaned DataFrame
To count clipped values, compare the original column to the clipped version before overwriting.
Step 7: Understanding the Pipeline Flow
Before we build the final pipeline, let's make sure you understand how the steps chain together. Each cleaning function takes a DataFrame and returns a cleaned version. The output of one step becomes the input to the next.
What does this code print? Think about each step carefully.
import pandas as pd
data = {'name': [' alice ', 'BOB', ' alice ', None],
'score': [10, 20, 10, 30]}
df = pd.DataFrame(data)
# Step 1: Drop missing names
df = df.dropna(subset=['name'])
# Step 2: Standardize names
df['name'] = df['name'].str.strip().str.title()
# Step 3: Remove duplicates
df = df.drop_duplicates(subset=['name'])
print(len(df))
print(df['name'].tolist())Step 8: Build the Complete Pipeline
Now it's time to combine all the cleaning steps into a single reusable function. A good pipeline function takes messy data in, applies each cleaning step in the right order, and returns clean data. The order matters: you should handle missing values before removing duplicates, and standardize text before checking for duplicates.
Write a function clean_data(df) that applies these steps in order:
1. Drop rows where name is missing
2. Standardize name: strip whitespace and convert to title case
3. Standardize email: strip whitespace and convert to lowercase
4. Remove duplicate rows based on name and email
5. Convert age to numeric (coerce errors), fill NaN with median, round and convert to int
6. Clip age between 0 and 120
7. Fill missing purchase_total with 0.0
8. Reset the index (drop=True)
9. Return the cleaned DataFrame
Print the final row count and the cleaned DataFrame.
What You Built
| You now have a reusable data cleaning pipeline that handles the most common data quality issues: | ||
|---|---|---|
| --- | --- | --- |
| Missing values | Drop or fill gaps | dropna(), fillna() |
| Duplicates | Remove repeated rows | drop_duplicates() |
| Data types | Convert strings to numbers | pd.to_numeric() |
| Text cleanup | Standardize casing and whitespace | .str.strip(), .str.title() |
| Outliers | Clip extreme values | .clip() |
In real projects, you'd extend this pattern by adding validation checks, logging what was cleaned, and handling domain-specific rules. But the core structure stays the same: a function that takes messy data in and returns clean data out.