Pandas GroupBy: Split-Apply-Combine for Powerful Aggregations
Imagine you have a giant pile of receipts from a store. You want to know how much each department earned last month. What do you do? You sort the receipts into piles by department, add up each pile, and write down the totals.
That is exactly what groupby() does in Pandas. It splits your data into groups based on some column, applies a calculation to each group, and combines the results back together. This three-step pattern is called split-apply-combine, and it is one of the most powerful ideas in data analysis.
In this tutorial, you'll learn how to group data, aggregate with multiple functions, use named aggregation, transform values within groups, filter entire groups, and work with multi-level grouping.
How Does GroupBy Work?
When you call df.groupby('column'), Pandas does not immediately compute anything. Instead, it creates a GroupBy object -- a lazy container that knows which rows belong to which group. The real work happens when you call an aggregation method like .sum(), .mean(), or .count().
The result is a Series with the group labels as the index. East had 200 + 300 + 180 = 680 in revenue, and West had 150 + 250 + 320 = 720.
How Do You Aggregate with Multiple Functions?
Often you need more than just a sum. You might want the mean, count, and max all at once. The .agg() method lets you pass a list of function names or even a dictionary that maps columns to different functions.
What Is GroupBy Transform?
Sometimes you don't want a summary table. You want to add a column to the original DataFrame that contains the group-level result for every row. For example, you want each row to show the average revenue of its region. That is what .transform() does -- it returns a Series with the same length as the original data.
How Do You Filter Entire Groups?
The .filter() method lets you keep or drop entire groups based on a condition. For example, you might want only regions that had more than two sales, or departments where the average score was above 80.
Class B had an average grade of 57.5, which is below 75, so every row from class B was removed. Classes A and C passed the filter and kept all their rows.
Grouping by Multiple Columns
You can pass a list of columns to groupby() to create subgroups. This is like sorting receipts first by department, then by payment method within each department.
Custom Group Logic with apply()
When built-in aggregations aren't enough, .apply() lets you run any function on each group. The function receives a DataFrame (the group) and can return a scalar, a Series, or a DataFrame.
Practice Exercises
Given a DataFrame df with columns department and salary, group by department and print the total salary for each department. The DataFrame is already created for you.
Print the result of the groupby sum (it should be a Series).
Use named aggregation to group df by region and compute:
total_revenue: sum of revenueavg_units: mean of unitsPrint the resulting DataFrame.
What will this code print? Think about how many rows each group has.
import pandas as pd
df = pd.DataFrame({'team': ['A', 'B', 'A', 'B', 'A'], 'score': [10, 20, 30, 40, 50]})
print(df.groupby('team')['score'].count().tolist())Add a new column called pct_of_dept to df that shows each employee's salary as a percentage of their department's total salary. Round to 1 decimal place.
Print df[['name', 'department', 'pct_of_dept']].to_string(index=False).
Filter df to keep only teams that have more than 2 members. Print the name column of the filtered DataFrame as a list.
Use groupby().filter() with a lambda.
Group df by both region and product, then compute the sum of revenue. Reset the index and print the resulting DataFrame using .to_string(index=False).
This code tries to add a column showing each row's deviation from its group mean, but it has a bug. Fix it so it works correctly.
The output should print the deviation column as a list, rounded to 1 decimal place each.