Pandas Data Cleaning: Missing Values, Duplicates, and Outliers

Intermediate30 min7 exercises105 XP

0/7 exercises

You just got a dataset from a survey. Some people skipped questions, one person typed "N/A" instead of leaving a field blank, there are three duplicate entries, and someone claims to be 900 years old. Welcome to real-world data.

Data scientists spend roughly 80% of their time cleaning data and only 20% actually analyzing it. The good news: pandas has built-in tools for every common cleaning task. In this tutorial, you'll learn how to detect and handle missing values, remove duplicates, fix data types, and deal with outliers.

How Do You Find Missing Values?

In pandas, missing data is represented as NaN (Not a Number). You can also create it explicitly with None or float('nan'). The key methods are isna() (returns True where values are missing) and notna() (returns True where values exist).

Counting missing values

Loading editor...

Should You Drop or Fill Missing Values?

dropna() removes rows (or columns) that contain missing values. Use it when missing data is rare and dropping rows won't bias your results.

dropna() options

Loading editor...

How Do You Fill in Missing Values?

fillna() replaces missing values with a value you choose. Common strategies: fill with zero, fill with the column mean, or fill with the most recent value (forward fill).

fillna() strategies

Loading editor...

How Do You Handle Duplicate Rows?

Duplicate rows sneak in from data entry errors, merging datasets, or buggy ETL pipelines. duplicated() flags them and drop_duplicates() removes them.

Finding and removing duplicates

Loading editor...

You can also check for duplicates based on specific columns. For example, if a customer name should be unique, you can pass subset=['name'] to check only that column.

Dedup by specific columns

Loading editor...

How Do You Fix Inconsistent Data?

Real datasets are messy. Someone typed "Yes" and someone else typed "yes" and a third person typed "Y". The replace() method standardizes these inconsistencies. And astype() fixes wrong data types.

replace() and astype()

Loading editor...

How Do You Rename Columns?

Column names from raw data are often ugly — spaces, mixed case, abbreviations. Clean column names make your code readable.

Renaming columns

Loading editor...

How Do You Spot Outliers?

An outlier is a data point that's far outside the normal range. A 900-year-old person or a $-500 salary. One common approach is to flag values beyond 2 or 3 standard deviations from the mean, or use the IQR (interquartile range) method.

IQR method for outlier detection

Loading editor...

How Do You Chain Cleaning Steps Together?

In practice, you rarely apply just one cleaning operation. A typical pipeline drops duplicates, fills missing values, fixes types, and renames columns — all in sequence. Pandas supports method chaining so you can do this in a readable flow.

Chained cleaning pipeline

Loading editor...

Quick Recap: Data Cleaning Checklist

Every time you get a new dataset, run through this checklist:

Missing values: df.isna().sum() — How much is missing? Drop or fill?

Duplicates: df.duplicated().sum() — Any exact duplicates?

Data types: df.dtypes — Are numbers stored as strings?

Inconsistencies: df['col'].unique() — Are there variations like Yes/yes/Y?

Outliers: df.describe() — Do the min/max values make sense?

Column names: Are they clean, lowercase, snake_case?

Practice Exercises

Count Missing Values

Write Code

Count the total number of missing values across the entire DataFrame and print the result as an integer.

Loading editor...

Fill Missing Scores with the Mean

Write Code

Fill missing values in the 'score' column with the column's mean. Print the score column as a list. The mean should be computed before filling.

Loading editor...

Drop Rows Missing a Name

Write Code

Drop only the rows where the 'name' column is missing (don't drop rows that have missing values in other columns). Print the remaining names as a list.

Loading editor...

Remove Duplicate Emails

Write Code

Remove rows with duplicate 'email' values, keeping the first occurrence. Print the remaining emails as a list.

Loading editor...

Standardize Yes/No Responses

Write Code

Standardize the 'answer' column so that all variations of yes ('Yes', 'Y', 'y') become 'yes' and all variations of no ('No', 'N', 'n') become 'no'. Print the cleaned answers as a list.

Loading editor...

Predict the Output: astype

Predict Output

What will this code print?

import pandas as pd
df = pd.DataFrame({'val': ['1', '2', '3']})
df['val'] = df['val'].astype(int)
print(df['val'].sum())

Loading editor...

Remove Outlier Ages

Write Code

Remove rows where the 'age' column has values less than 0 or greater than 120 (impossible ages). Print the remaining ages as a list.

Loading editor...