Pandas merge(), join(), concat(): Combine DataFrames Like SQL
You have a table of customers and a separate table of orders. You need to match each order to the customer who placed it. If you've used SQL, you know this as a JOIN. In pandas, the same idea exists — it's called merge().
Combining data from multiple sources is one of the most important skills in data analysis. In this tutorial, you'll learn three ways to combine DataFrames: merge() for SQL-style joins, concat() for stacking DataFrames, and join() for index-based combinations.
How Does pd.merge() Work?
pd.merge() combines two DataFrames based on a shared column — just like a SQL JOIN. The default is an inner join, which keeps only the rows that have a match in both DataFrames.
Notice that Alice appears twice because she has two orders, and Charlie and Diana disappear because they have no orders. That's the inner join — only matches survive.
What Are the Different Join Types?
The how parameter controls which rows to keep. There are four options: inner (default), left, right, and outer.
What If the Column Names Don't Match?
Often the join key has a different name in each table — one calls it 'customer_id', the other calls it 'cust_id'. Use left_on and right_on instead of on.
How Do You Stack DataFrames with concat()?
While merge() combines DataFrames side by side (adding columns), concat() stacks them (adding rows). It's like appending one spreadsheet below another.
You can also concatenate horizontally (side by side) by setting axis=1. This is useful when you have separate DataFrames for different columns of the same observations.
How Do You Debug a Merge?
When a merge gives unexpected results, the indicator=True parameter adds a _merge column that tells you where each row came from: both, left_only, or right_only.
What Happens with Duplicate Column Names?
If both DataFrames have a column with the same name (that isn't the join key), pandas adds suffixes _x and _y to tell them apart. You can customize these with the suffixes parameter.
What Are One-to-Many and Many-to-Many Joins?
A one-to-many join is the most common: one customer has many orders. The customer row gets duplicated for each matching order. A many-to-many join happens when both sides have duplicate keys — and it creates a cartesian product of the matches.
Quick Recap: Choosing the Right Combine Method
Pandas gives you three ways to combine DataFrames. Here's when to use each:
Practice Exercises
Merge the students and grades DataFrames on 'student_id' using an inner join. Print the resulting names and grades as a list of lists using .values.tolist().
Perform a left join of customers and orders on 'customer_id'. Print the number of rows in the result using len().
Merge products and inventory where the key is called 'prod_id' in products and 'product_id' in inventory. Print the product names and their stock levels as a list of lists.
Concatenate q1 and q2 DataFrames vertically (stacking rows) with ignore_index=True. Print the shape of the result.
What will this code print?
import pandas as pd
left = pd.DataFrame({'id': [1, 2], 'a': ['x', 'y']})
right = pd.DataFrame({'id': [2, 3], 'b': ['p', 'q']})
result = pd.merge(left, right, on='id', how='outer')
print(len(result))After merging, this code tries to access a column called 'value_2023' but it doesn't exist because pandas used default suffixes. Fix the merge to use custom suffixes ('_2022', '_2023') and print the 'value_2023' column as a list.