Pandas Data Cleaning: Missing Values, Duplicates, and Outliers
You just got a dataset from a survey. Some people skipped questions, one person typed "N/A" instead of leaving a field blank, there are three duplicate entries, and someone claims to be 900 years old. Welcome to real-world data.
Data scientists spend roughly 80% of their time cleaning data and only 20% actually analyzing it. The good news: pandas has built-in tools for every common cleaning task. In this tutorial, you'll learn how to detect and handle missing values, remove duplicates, fix data types, and deal with outliers.
How Do You Find Missing Values?
In pandas, missing data is represented as NaN (Not a Number). You can also create it explicitly with None or float('nan'). The key methods are isna() (returns True where values are missing) and notna() (returns True where values exist).
Should You Drop or Fill Missing Values?
dropna() removes rows (or columns) that contain missing values. Use it when missing data is rare and dropping rows won't bias your results.
How Do You Fill in Missing Values?
fillna() replaces missing values with a value you choose. Common strategies: fill with zero, fill with the column mean, or fill with the most recent value (forward fill).
How Do You Handle Duplicate Rows?
Duplicate rows sneak in from data entry errors, merging datasets, or buggy ETL pipelines. duplicated() flags them and drop_duplicates() removes them.
You can also check for duplicates based on specific columns. For example, if a customer name should be unique, you can pass subset=['name'] to check only that column.
How Do You Fix Inconsistent Data?
Real datasets are messy. Someone typed "Yes" and someone else typed "yes" and a third person typed "Y". The replace() method standardizes these inconsistencies. And astype() fixes wrong data types.
How Do You Rename Columns?
Column names from raw data are often ugly — spaces, mixed case, abbreviations. Clean column names make your code readable.
How Do You Spot Outliers?
An outlier is a data point that's far outside the normal range. A 900-year-old person or a $-500 salary. One common approach is to flag values beyond 2 or 3 standard deviations from the mean, or use the IQR (interquartile range) method.
How Do You Chain Cleaning Steps Together?
In practice, you rarely apply just one cleaning operation. A typical pipeline drops duplicates, fills missing values, fixes types, and renames columns — all in sequence. Pandas supports method chaining so you can do this in a readable flow.
Quick Recap: Data Cleaning Checklist
Every time you get a new dataset, run through this checklist:
df.isna().sum() — How much is missing? Drop or fill?df.duplicated().sum() — Any exact duplicates?df.dtypes — Are numbers stored as strings?df['col'].unique() — Are there variations like Yes/yes/Y?df.describe() — Do the min/max values make sense?Practice Exercises
Count the total number of missing values across the entire DataFrame and print the result as an integer.
Fill missing values in the 'score' column with the column's mean. Print the score column as a list. The mean should be computed before filling.
Drop only the rows where the 'name' column is missing (don't drop rows that have missing values in other columns). Print the remaining names as a list.
Remove rows with duplicate 'email' values, keeping the first occurrence. Print the remaining emails as a list.
Standardize the 'answer' column so that all variations of yes ('Yes', 'Y', 'y') become 'yes' and all variations of no ('No', 'N', 'n') become 'no'. Print the cleaned answers as a list.
What will this code print?
import pandas as pd
df = pd.DataFrame({'val': ['1', '2', '3']})
df['val'] = df['val'].astype(int)
print(df['val'].sum())Remove rows where the 'age' column has values less than 0 or greater than 120 (impossible ages). Print the remaining ages as a list.