Skip to main content

Build an Automated Data Cleaning Pipeline

Intermediate40 min8 exercises120 XP
0/8 exercises

Real-world data is messy. Names are spelled differently, fields are missing, dates are in three different formats, and somehow someone entered "N/A" as their age. Before you can analyze data, you need to clean it.

In this project, you'll build a complete data cleaning pipeline step by step. By the end, you'll have a reusable function that takes messy data in and spits clean data out. This is the kind of code data analysts write every single day.

Step 1: Load the Messy Data

Every data cleaning project starts with loading the data and taking a first look. Our dataset is a customer list from a fictional online store. It has all the typical problems you'd see in real data pulled from a database or CSV export.

Our messy customer dataset
Loading editor...

Take a moment to spot the problems. There are inconsistent name casings, duplicate rows, non-numeric ages, impossible ages, missing values, extra whitespace, and dates in different formats. A typical Tuesday for a data analyst.

Exercise 1: Load and Inspect the Data
Write Code

Create the messy DataFrame from the provided dictionary and print a summary showing:

1. The number of rows and columns

2. The count of missing values per column

Print the output in this exact format:

Rows: 10, Columns: 5
Missing values:
name              2
email             1
age               0
signup_date       1
purchase_total    1

Note: The name column has one None value and one whitespace-only value " Eve Adams " is NOT missing (it has content). However, the None counts as 1 missing. Look at the data carefully.

Loading editor...

Step 2: Identify and Handle Missing Values

Missing data is the most common problem you'll encounter. There are two main strategies: drop the rows with missing values, or fill them with something sensible. The right choice depends on the column and how much data is missing.

For our dataset, we'll drop rows where the name is missing (we can't identify a customer without a name) and fill missing purchase totals with 0.

Handling missing values with dropna and fillna
Loading editor...
Exercise 2: Handle Missing Values
Write Code

Write a function handle_missing(df) that:

1. Drops rows where name is missing (None/NaN)

2. Drops rows where email is missing

3. Fills missing purchase_total values with 0.0

4. Fills missing signup_date values with "Unknown"

5. Returns the cleaned DataFrame

Then call it and print how many rows remain.

Loading editor...

Step 3: Remove Duplicate Rows

Duplicate rows inflate your analysis. If Alice appears twice, her purchase total gets counted twice. Pandas makes deduplication easy with drop_duplicates(). You can choose which columns determine whether two rows are "the same."

Removing duplicates
Loading editor...
Exercise 3: Remove Duplicates
Write Code

Write a function remove_duplicates(df) that:

1. Prints how many duplicate rows were found

2. Removes duplicates based on the name and email columns (keep the first occurrence)

3. Resets the index (with drop=True)

4. Returns the cleaned DataFrame

Print format: Duplicates found: X

Loading editor...

Step 4: Fix Data Types

Sometimes columns that should be numbers contain text values. The age column in our dataset has entries like "thirty-two" mixed in with real numbers. We need to convert these to proper numeric types, handling the invalid entries gracefully.

Converting to numeric with error handling
Loading editor...
Exercise 4: Fix Data Types
Write Code

Write a function fix_data_types(df) that:

1. Converts the age column to numeric (coercing errors to NaN)

2. Converts the purchase_total column to numeric (coercing errors)

3. Fills any NaN values in age with the median of the valid ages

4. Rounds the filled age values to whole numbers (use .round(0)) then converts to int

5. Returns the cleaned DataFrame

Print the age column after cleaning.

Loading editor...

Step 5: Standardize Text Fields

Text inconsistency is sneaky. "Alice Smith", "alice smith", and " ALICE SMITH " are all the same person, but your code treats them as different values. Standardizing text means picking one format and applying it everywhere.

Standardizing text with strip and title
Loading editor...
Exercise 5: Standardize Text Fields
Write Code

Write a function standardize_text(df) that:

1. Strips leading/trailing whitespace from the name column

2. Converts name to title case (first letter of each word capitalized)

3. Strips whitespace from the email column

4. Converts email to lowercase

5. Returns the cleaned DataFrame

Print the name and email columns after cleaning.

Loading editor...

Step 6: Handle Outliers

Outliers are values that are technically valid numbers but make no sense in context. An age of -5 or 150 is a valid integer, but not a valid human age. We need to define reasonable bounds and deal with values outside them.

There are two common approaches: clip the values to the boundary (so 150 becomes the max valid age), or replace them with NaN and fill later. Clipping is simpler and preserves all your rows.

Exercise 6: Handle Outliers
Write Code

Write a function handle_outliers(df) that:

1. Clips the age column to be between 0 and 120 (inclusive)

2. Clips the purchase_total column to be between 0 and 10000

3. Prints how many values were clipped in each column

4. Returns the cleaned DataFrame

To count clipped values, compare the original column to the clipped version before overwriting.

Loading editor...

Step 7: Understanding the Pipeline Flow

Before we build the final pipeline, let's make sure you understand how the steps chain together. Each cleaning function takes a DataFrame and returns a cleaned version. The output of one step becomes the input to the next.

Exercise 7: Predict the Pipeline Output
Predict Output

What does this code print? Think about each step carefully.

import pandas as pd

data = {'name': ['  alice  ', 'BOB', '  alice  ', None],
        'score': [10, 20, 10, 30]}
df = pd.DataFrame(data)

# Step 1: Drop missing names
df = df.dropna(subset=['name'])

# Step 2: Standardize names
df['name'] = df['name'].str.strip().str.title()

# Step 3: Remove duplicates
df = df.drop_duplicates(subset=['name'])

print(len(df))
print(df['name'].tolist())
Loading editor...

Step 8: Build the Complete Pipeline

Now it's time to combine all the cleaning steps into a single reusable function. A good pipeline function takes messy data in, applies each cleaning step in the right order, and returns clean data. The order matters: you should handle missing values before removing duplicates, and standardize text before checking for duplicates.

Exercise 8: The Complete Pipeline
Write Code

Write a function clean_data(df) that applies these steps in order:

1. Drop rows where name is missing

2. Standardize name: strip whitespace and convert to title case

3. Standardize email: strip whitespace and convert to lowercase

4. Remove duplicate rows based on name and email

5. Convert age to numeric (coerce errors), fill NaN with median, round and convert to int

6. Clip age between 0 and 120

7. Fill missing purchase_total with 0.0

8. Reset the index (drop=True)

9. Return the cleaned DataFrame

Print the final row count and the cleaned DataFrame.

Loading editor...

What You Built

You now have a reusable data cleaning pipeline that handles the most common data quality issues:
---------
Missing valuesDrop or fill gapsdropna(), fillna()
DuplicatesRemove repeated rowsdrop_duplicates()
Data typesConvert strings to numberspd.to_numeric()
Text cleanupStandardize casing and whitespace.str.strip(), .str.title()
OutliersClip extreme values.clip()

In real projects, you'd extend this pattern by adding validation checks, logging what was cleaned, and handling domain-specific rules. But the core structure stays the same: a function that takes messy data in and returns clean data out.