Grouping and aggregating data in Pandas means splitting a DataFrame into groups based on one or more columns, applying a summary function (like sum, mean, or count) to each group, and combining the results into a new table. The groupby() method is the primary tool for this workflow, enabling you to answer questions like “What is the average salary per department?” or “What are the total monthly sales by region?” in just one or two lines of code.
Introduction: Why Grouping and Aggregation Are Central to Data Analysis
Imagine you have a dataset with one million sales transactions recorded over three years, each row representing a single purchase with columns for date, product category, region, sales amount, and customer ID. Raw transaction-level data like this is almost impossible to interpret on its own. What you really want to know is the total revenue per region, the average order value per product category, the number of transactions per month, or which customer segment generates the most profit.
Answering these kinds of questions requires aggregation — summarizing many individual data points into a smaller number of meaningful statistics. And aggregation almost always follows grouping — partitioning the data into subsets based on one or more variables before computing the summary. This split-apply-combine pattern is one of the most fundamental operations in all of data analysis, from exploratory data analysis to reporting dashboards to feature engineering for machine learning.
Pandas provides an exceptionally powerful and flexible toolset for grouping and aggregating data, centered around the groupby() method. Once you master groupby(), you will find yourself reaching for it constantly — it is the analytical backbone of most real-world Pandas workflows. This article will take you from the basic mechanics of groupby() through advanced techniques including multiple aggregations, custom functions, named aggregations, transformations, filtering, and pivot tables.
1. The Split-Apply-Combine Pattern
Before diving into code, it helps to understand the conceptual model that Pandas groupby() is built on. Computer scientist Hadley Wickham formalized this pattern in his influential 2011 paper, and it applies universally across data manipulation tools:
Split: Divide the DataFrame into groups based on the unique values of one or more columns (the grouping keys).
Apply: Apply a function independently to each group. The function might compute a summary statistic (aggregation), transform each value within the group, or filter the group based on some condition.
Combine: Collect the results from each group and combine them into a single output DataFrame or Series.
Here is a visual illustration of the pattern:
Original DataFrame:
Department Salary
0 Engineering 85000
1 Marketing 72000
2 Engineering 78000
3 HR 65000
4 Marketing 68000
5 Engineering 92000
SPLIT by Department:
Engineering: [85000, 78000, 92000]
Marketing: [72000, 68000]
HR: [65000]
APPLY mean():
Engineering: 85000.00
Marketing: 70000.00
HR: 65000.00
COMBINE:
Department Salary
Engineering 85000.00
HR 65000.00
Marketing 70000.00This three-step mental model applies to every groupby() operation you will ever write, no matter how complex.
2. Getting Started with groupby()
2.1 Creating Your First GroupBy Object
The groupby() method returns a GroupBy object — not a DataFrame. This object holds the instructions for how to split the data, but it has not yet performed any computation. Computation happens when you call an aggregation method on the GroupBy object.
import pandas as pd
import numpy as np
# Sample employee dataset
employees = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Edward', 'Fiona', 'George'],
'Department': ['Engineering', 'Marketing', 'Engineering', 'HR',
'Marketing', 'Engineering', 'HR'],
'Gender': ['F', 'M', 'M', 'F', 'M', 'F', 'M'],
'Salary': [85000, 72000, 78000, 65000, 68000, 92000, 70000],
'Years_Exp': [4, 7, 3, 8, 5, 6, 9],
'Remote': [True, False, True, True, False, True, False]
})
# Create a GroupBy object
grouped = employees.groupby('Department')
print(type(grouped))
# <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
# The GroupBy object is lazy — call an aggregation to get results
print(grouped['Salary'].mean())
# Department
# Engineering 85000.000000
# HR 67500.000000
# Marketing 70000.000000
# Name: Salary, dtype: float64Notice that the result is a Series whose index is the unique department names. This is the standard output format for a single-column aggregation after a groupby().
2.2 Inspecting a GroupBy Object
Before computing aggregations, it can be helpful to inspect the groups themselves:
# See which row indices belong to each group
print(grouped.groups)
# {'Engineering': [0, 2, 5], 'HR': [3, 6], 'Marketing': [1, 4]}
# Count the number of rows in each group
print(grouped.size())
# Department
# Engineering 3
# HR 2
# Marketing 2
# dtype: int64
# Get a specific group as a DataFrame
eng_group = grouped.get_group('Engineering')
print(eng_group)
# Name Department Gender Salary Years_Exp Remote
# 0 Alice Engineering F 85000 4 True
# 2 Charlie Engineering M 78000 3 True
# 5 Fiona Engineering F 92000 6 True2.3 Common Single-Function Aggregations
Pandas GroupBy objects expose all common aggregation functions directly as methods:
# Mean salary per department
print(employees.groupby('Department')['Salary'].mean())
# Total salary budget per department
print(employees.groupby('Department')['Salary'].sum())
# Maximum salary per department
print(employees.groupby('Department')['Salary'].max())
# Minimum salary per department
print(employees.groupby('Department')['Salary'].min())
# Count of employees per department
print(employees.groupby('Department')['Salary'].count())
# Standard deviation of salary per department
print(employees.groupby('Department')['Salary'].std())
# Median salary per department
print(employees.groupby('Department')['Salary'].median())Each of these returns a Series indexed by the grouping column. The count() function counts non-null values (useful for detecting missing data within groups), while size() on the GroupBy object counts all rows including nulls.
3. Grouping by Multiple Columns
One of the most powerful features of groupby() is the ability to group by more than one column simultaneously, creating a hierarchical structure of groups.
# Group by both Department and Gender
multi_grouped = employees.groupby(['Department', 'Gender'])
print(multi_grouped['Salary'].mean())
# Department Gender
# Engineering F 88500.0
# M 78000.0
# HR F 65000.0
# M 70000.0
# Marketing M 70000.0
# Name: Salary, dtype: float64The result has a MultiIndex — a hierarchical index with Department at the outer level and Gender at the inner level. This is a natural representation of grouped data at two levels.
3.1 Resetting the Index After Groupby
Often you want the grouping columns to be regular columns in the result rather than part of the index. Use reset_index() for this:
result = employees.groupby(['Department', 'Gender'])['Salary'].mean().reset_index()
print(result)
# Department Gender Salary
# 0 Engineering F 88500.0
# 1 Engineering M 78000.0
# 2 HR F 65000.0
# 3 HR M 70000.0
# 4 Marketing M 70000.0Now the result is a clean, flat DataFrame that is easy to use for further analysis or visualization.
3.2 The as_index Parameter
Alternatively, you can pass as_index=False directly to groupby() to get a flat DataFrame without a subsequent reset_index() call:
result = employees.groupby('Department', as_index=False)['Salary'].mean()
print(result)
# Department Salary
# 0 Engineering 85000.0
# 1 HR 67500.0
# 2 Marketing 70000.0This is a convenient shorthand when you know in advance that you want a flat result.
4. Multiple Aggregations with agg()
Computing a single summary statistic per group is useful, but real analysis usually requires multiple statistics at once. The agg() method (short for aggregate) lets you apply multiple functions simultaneously.
4.1 Applying Multiple Functions to One Column
# Multiple statistics for salary per department
salary_stats = employees.groupby('Department')['Salary'].agg(
['mean', 'median', 'std', 'min', 'max', 'count']
)
print(salary_stats)
# mean median std min max count
# Department
# Engineering 85000 85000 7000.000000 78000 92000 3
# HR 67500 67500 3535.533906 65000 70000 2
# Marketing 70000 70000 2828.427125 68000 72000 24.2 Applying Different Functions to Different Columns
This is where agg() truly shines. You can specify a dictionary mapping each column to its aggregation function(s):
result = employees.groupby('Department').agg({
'Salary': ['mean', 'max'],
'Years_Exp': ['mean', 'min'],
'Remote': 'sum' # Count of remote workers (True = 1)
})
print(result)
# Salary Years_Exp Remote
# mean max mean min sum
# Department
# Engineering 85000 92000 4.333333 3 3
# HR 67500 70000 8.500000 8 1
# Marketing 70000 72000 6.000000 5 0The result has a MultiIndex column structure — a hierarchy of (column name, function name). This can be useful but is sometimes cumbersome to work with. Named aggregations solve this.
4.3 Named Aggregations with agg() — The Modern Approach
Introduced in Pandas 0.25, named aggregations give each result column a custom name directly within the agg() call, producing a clean flat DataFrame:
result = employees.groupby('Department').agg(
avg_salary = ('Salary', 'mean'),
max_salary = ('Salary', 'max'),
avg_exp = ('Years_Exp', 'mean'),
remote_count = ('Remote', 'sum'),
headcount = ('Name', 'count')
)
print(result)
# avg_salary max_salary avg_exp remote_count headcount
# Department
# Engineering 85000.0 92000 4.333333 3 3
# HR 67500.0 70000 8.500000 1 2
# Marketing 70000.0 72000 6.000000 0 2Named aggregations are the recommended modern approach. The syntax new_column_name=('source_column', 'function') is clean, readable, and produces intuitive column names without needing to rename columns afterward.
4.4 Using Custom Functions in agg()
Beyond built-in aggregation functions like 'mean' and 'sum', you can pass any Python function — including lambda functions or custom-defined functions:
# Custom aggregation: salary range (max - min)
def salary_range(series):
return series.max() - series.min()
result = employees.groupby('Department').agg(
avg_salary = ('Salary', 'mean'),
salary_range = ('Salary', salary_range),
pct_remote = ('Remote', lambda x: x.mean() * 100) # % remote
)
print(result)
# avg_salary salary_range pct_remote
# Department
# Engineering 85000.0 14000.0 100.000000
# HR 67500.0 5000.0 50.000000
# Marketing 70000.0 4000.0 0.000000Custom functions receive each group’s column data as a Series and should return a single scalar value.
5. The transform() Method: Group-Aware Transformations
While agg() reduces each group to a single summary value, transform() returns a result with the same shape as the original DataFrame — one value per row, computed within each group. This makes transform() ideal for feature engineering and for adding group-level statistics as new columns.
5.1 Adding Group Means Back to the DataFrame
# Add department average salary as a new column
employees['dept_avg_salary'] = employees.groupby('Department')['Salary'].transform('mean')
print(employees[['Name', 'Department', 'Salary', 'dept_avg_salary']])
# Name Department Salary dept_avg_salary
# 0 Alice Engineering 85000 85000.000
# 1 Bob Marketing 72000 70000.000
# 2 Charlie Engineering 78000 85000.000
# 3 Diana HR 65000 67500.000
# 4 Edward Marketing 68000 70000.000
# 5 Fiona Engineering 92000 85000.000
# 6 George HR 70000 67500.000Each employee’s row now shows their department’s average salary, perfectly aligned with the original index.
5.2 Computing Salary Deviation from Department Mean
This is a classic feature engineering technique — measuring how much each individual deviates from their group’s central tendency:
employees['salary_deviation'] = (
employees['Salary'] - employees.groupby('Department')['Salary'].transform('mean')
)
print(employees[['Name', 'Department', 'Salary', 'salary_deviation']])
# Name Department Salary salary_deviation
# 0 Alice Engineering 85000 0.0
# 1 Bob Marketing 72000 2000.0
# 2 Charlie Engineering 78000 -7000.0
# 3 Diana HR 65000 -2500.0
# 4 Edward Marketing 68000 -2000.0
# 5 Fiona Engineering 92000 7000.0
# 6 George HR 70000 2500.0Positive values indicate above-average salaries within the department; negative values indicate below-average. This kind of within-group normalization is extremely common in machine learning feature engineering.
5.3 Group Rank with transform()
You can also use transform() with rank() to compute each employee’s salary rank within their department:
employees['dept_salary_rank'] = employees.groupby('Department')['Salary'].rank(
ascending=False, method='dense'
)
print(employees[['Name', 'Department', 'Salary', 'dept_salary_rank']])
# Name Department Salary dept_salary_rank
# 0 Alice Engineering 85000 2.0
# 1 Bob Marketing 72000 1.0
# 2 Charlie Engineering 78000 3.0
# 3 Diana HR 65000 2.0
# 4 Edward Marketing 68000 2.0
# 5 Fiona Engineering 92000 1.0
# 6 George HR 70000 1.05.4 Group-Aware Normalization (Z-Score)
Z-score normalization within groups (also called standardization) is a powerful preprocessing technique that puts all features on a comparable scale while accounting for group differences:
from scipy import stats
# Z-score normalization within each department
employees['salary_zscore'] = employees.groupby('Department')['Salary'].transform(
lambda x: stats.zscore(x, ddof=1) if len(x) > 1 else 0
)
print(employees[['Name', 'Department', 'Salary', 'salary_zscore']].round(3))6. The filter() Method: Keeping or Removing Entire Groups
The filter() method allows you to keep or discard entire groups based on a condition applied to each group. Unlike boolean indexing on individual rows, filter() operates at the group level — if the condition is True for a group, all rows in that group are kept; if False, the entire group is dropped.
# Keep only departments with average salary above 75,000
high_paying_depts = employees.groupby('Department').filter(
lambda group: group['Salary'].mean() > 75000
)
print(high_paying_depts[['Name', 'Department', 'Salary']])
# Name Department Salary
# 0 Alice Engineering 85000
# 2 Charlie Engineering 78000
# 5 Fiona Engineering 92000Only Engineering employees are kept (average salary $85,000), while HR ($67,500) and Marketing ($70,000) are excluded.
# Keep only departments with 3 or more employees
large_depts = employees.groupby('Department').filter(
lambda group: len(group) >= 3
)
print(large_depts)
# Only Engineering rows (3 employees); HR and Marketing each have 2# Keep departments where at least one employee has 8+ years of experience
exp_depts = employees.groupby('Department').filter(
lambda group: group['Years_Exp'].max() >= 8
)
print(exp_depts[['Name', 'Department', 'Years_Exp']])
# HR (George: 9 years, Diana: 8 years) meets the condition
# Engineering and Marketing do notThe filter() method is particularly useful for removing small or unrepresentative groups from your analysis — for example, filtering out product categories with fewer than 10 orders before computing average order values.
7. Iterating Over Groups
Sometimes you need to process each group individually — for example, to apply a complex transformation, generate a separate report per group, or train a separate model per segment. The GroupBy object is iterable, yielding (group_name, group_dataframe) tuples:
for dept_name, dept_df in employees.groupby('Department'):
avg = dept_df['Salary'].mean()
top_earner = dept_df.loc[dept_df['Salary'].idxmax(), 'Name']
print(f"\n{dept_name}:")
print(f" Employees: {len(dept_df)}")
print(f" Avg Salary: ${avg:,.0f}")
print(f" Top Earner: {top_earner}")
# Engineering:
# Employees: 3
# Avg Salary: $85,000
# Top Earner: Fiona
# HR:
# Employees: 2
# Avg Salary: $67,500
# Top Earner: George
# Marketing:
# Employees: 2
# Avg Salary: $70,000
# Top Earner: BobWhile iteration is flexible, it is generally slower than vectorized agg() or transform() operations. Reserve it for cases where you genuinely need to process each group as an independent DataFrame.
8. Grouping with Time-Based Data
Grouping time series data is one of the most common real-world applications of Pandas groupby(). You can group by components of a datetime column (year, month, day, weekday) or use the resample() method for time-based aggregation.
8.1 Grouping by Date Components
import pandas as pd
import numpy as np
# Simulate daily sales data over 2 years
np.random.seed(42)
dates = pd.date_range('2023-01-01', '2024-12-31', freq='D')
sales_df = pd.DataFrame({
'date': dates,
'sales': np.random.normal(1000, 200, len(dates)).round(2),
'region': np.random.choice(['North', 'South', 'East', 'West'], len(dates)),
'category': np.random.choice(['Electronics', 'Clothing', 'Food'], len(dates))
})
# Extract date components
sales_df['year'] = sales_df['date'].dt.year
sales_df['month'] = sales_df['date'].dt.month
sales_df['weekday'] = sales_df['date'].dt.day_name()
sales_df['quarter'] = sales_df['date'].dt.quarter
# Monthly average sales
monthly_avg = sales_df.groupby(['year', 'month'])['sales'].mean().round(2)
print(monthly_avg.head(6))
# year month
# 2023 1 1005.47
# 2 997.83
# 3 1001.22
# ...
# Sales by day of week (which day generates most revenue?)
dow_sales = sales_df.groupby('weekday')['sales'].agg(
avg_sales='mean',
total_sales='sum',
count='count'
).round(2)
print(dow_sales.sort_values('avg_sales', ascending=False))8.2 Using resample() for Time-Based Grouping
When your DataFrame has a DatetimeIndex, resample() is a more elegant alternative to groupby() for time-based aggregation:
# Set date as index
sales_ts = sales_df.set_index('date')['sales']
# Resample to monthly total
monthly_total = sales_ts.resample('ME').sum()
print(monthly_total.head())
# date
# 2023-01-31 30726.84
# 2023-02-28 27938.64
# 2023-03-31 31042.18
# ...
# Resample to quarterly mean
quarterly_mean = sales_ts.resample('QE').mean()
print(quarterly_mean)
# Resample to weekly and compute multiple stats
weekly_stats = sales_ts.resample('W').agg(['mean', 'sum', 'count'])
print(weekly_stats.head())resample() uses time-frequency aliases: 'D' for daily, 'W' for weekly, 'ME' for month-end, 'QE' for quarter-end, 'YE' for year-end. It is the go-to tool for time series aggregation.
9. Pivot Tables: A Different View of Grouped Data
A pivot table is a two-dimensional summary table where the rows represent one grouping variable, the columns represent another grouping variable, and the cell values contain an aggregated statistic. Pivot tables are what most business analysts mean when they talk about “cross-tabulation” or “cross-tab” analysis.
9.1 Creating Pivot Tables with pivot_table()
import pandas as pd
import numpy as np
# Sales by region and category
np.random.seed(42)
n = 500
sales = pd.DataFrame({
'region': np.random.choice(['North', 'South', 'East', 'West'], n),
'category': np.random.choice(['Electronics', 'Clothing', 'Food'], n),
'sales': np.random.normal(500, 150, n).round(2),
'units': np.random.randint(1, 20, n)
})
# Pivot: rows=region, columns=category, values=mean sales
pivot = sales.pivot_table(
values='sales',
index='region',
columns='category',
aggfunc='mean'
).round(2)
print(pivot)
# category Clothing Electronics Food
# region
# East 499.61 498.34 500.12
# North 507.23 495.18 502.44
# South 493.87 510.76 498.93
# West 501.45 503.22 497.569.2 Pivot Tables with Multiple Values and Margins
# Multiple aggregation functions and adding row/column totals
pivot_full = sales.pivot_table(
values=['sales', 'units'],
index='region',
columns='category',
aggfunc={'sales': 'sum', 'units': 'mean'},
margins=True, # Add row and column totals/averages
margins_name='Total' # Label for the totals row/column
).round(2)
print(pivot_full)The margins=True option adds a “Total” row and column that aggregates across all groups, similar to the “Grand Total” in a spreadsheet pivot table.
9.3 Cross-Tabulation with crosstab()
pd.crosstab() is a specialized function for counting the frequency of combinations of categorical values — essentially a pivot table with aggfunc='count':
# Count employees by department and gender
crosstab = pd.crosstab(employees['Department'], employees['Gender'])
print(crosstab)
# Gender F M
# Department
# Engineering 2 1
# HR 1 1
# Marketing 0 2
# With row and column percentages
crosstab_pct = pd.crosstab(
employees['Department'],
employees['Gender'],
normalize='index' # Normalize each row to sum to 1
).round(3)
print(crosstab_pct)
# Gender F M
# Department
# Engineering 0.667 0.333
# HR 0.500 0.500
# Marketing 0.000 1.00010. Practical Comparison Table: GroupBy Methods
Understanding when to use each method is crucial. Here is a comprehensive comparison:
| Method | Returns | Shape | Primary Use Case | Example |
|---|---|---|---|---|
groupby().agg() | DataFrame or Series | Reduced (one row per group) | Summary statistics per group | Total sales per region |
groupby().transform() | Series or DataFrame | Same as original | Add group stats as new columns | Add group mean to each row |
groupby().filter() | DataFrame | Subset of original rows | Remove entire groups by condition | Keep only large departments |
groupby().apply() | Flexible | Depends on function | Complex custom group-level logic | Fit model per group |
pivot_table() | DataFrame | 2D summary table | Cross-tabular summaries | Sales by region and category |
crosstab() | DataFrame | 2D frequency table | Frequency of categorical combinations | Count by gender and department |
resample() | Series or DataFrame | Time-resampled | Time series aggregation | Monthly average from daily data |
11. Real-World Example: Retail Sales Analysis
Let us apply everything from this article to a realistic end-to-end analysis of a retail sales dataset:
import pandas as pd
import numpy as np
# Simulate retail transactions
np.random.seed(123)
n = 2000
df = pd.DataFrame({
'transaction_id': range(1, n + 1),
'date': pd.date_range('2024-01-01', periods=n, freq='3h'),
'customer_id': np.random.randint(1001, 1201, n),
'region': np.random.choice(['North', 'South', 'East', 'West'], n,
p=[0.30, 0.25, 0.25, 0.20]),
'category': np.random.choice(['Electronics', 'Clothing', 'Food', 'Sports'], n,
p=[0.25, 0.30, 0.30, 0.15]),
'units_sold': np.random.randint(1, 15, n),
'unit_price': np.random.choice([9.99, 24.99, 49.99, 99.99, 199.99], n),
'discount_pct': np.random.choice([0, 0.05, 0.10, 0.20], n, p=[0.5, 0.2, 0.2, 0.1])
})
# Feature engineering
df['revenue'] = (df['units_sold'] * df['unit_price'] * (1 - df['discount_pct'])).round(2)
df['month'] = df['date'].dt.to_period('M')
df['quarter'] = df['date'].dt.to_period('Q')
df['day_of_week']= df['date'].dt.day_name()
print(f"Dataset: {df.shape[0]:,} transactions | {df['revenue'].sum():,.2f} total revenue")
print(f"Date range: {df['date'].min().date()} to {df['date'].max().date()}")
# ── Analysis 1: Revenue summary by region ────────────────────────────────────
print("\n=== Revenue by Region ===")
region_summary = df.groupby('region').agg(
transactions = ('transaction_id', 'count'),
total_revenue = ('revenue', 'sum'),
avg_revenue = ('revenue', 'mean'),
avg_discount = ('discount_pct', 'mean')
).round(2)
region_summary['revenue_share_%'] = (
region_summary['total_revenue'] / region_summary['total_revenue'].sum() * 100
).round(1)
print(region_summary.sort_values('total_revenue', ascending=False))
# ── Analysis 2: Category performance ────────────────────────────────────────
print("\n=== Category Performance ===")
category_stats = df.groupby('category').agg(
total_units = ('units_sold', 'sum'),
total_revenue = ('revenue', 'sum'),
avg_price = ('unit_price', 'mean'),
transactions = ('transaction_id', 'count')
).round(2)
category_stats['revenue_per_unit'] = (
category_stats['total_revenue'] / category_stats['total_units']
).round(2)
print(category_stats.sort_values('total_revenue', ascending=False))
# ── Analysis 3: Region × Category cross-tab ─────────────────────────────────
print("\n=== Revenue by Region and Category (Pivot Table) ===")
revenue_pivot = df.pivot_table(
values='revenue',
index='region',
columns='category',
aggfunc='sum',
margins=True,
margins_name='Total'
).round(0)
print(revenue_pivot)
# ── Analysis 4: Customer analysis with transform ─────────────────────────────
print("\n=== Customer Spending Analysis ===")
df['customer_total_spend'] = df.groupby('customer_id')['revenue'].transform('sum')
df['customer_avg_spend'] = df.groupby('customer_id')['revenue'].transform('mean')
df['customer_txn_count'] = df.groupby('customer_id')['revenue'].transform('count')
# Classify customers by total spend
spend_quantiles = df.groupby('customer_id')['revenue'].sum().quantile([0.33, 0.67])
low_thresh, high_thresh = spend_quantiles[0.33], spend_quantiles[0.67]
customer_summary = df.groupby('customer_id').agg(
total_spend = ('revenue', 'sum'),
avg_order = ('revenue', 'mean'),
num_orders = ('transaction_id', 'count'),
top_category = ('category', lambda x: x.mode()[0])
).round(2)
customer_summary['segment'] = pd.cut(
customer_summary['total_spend'],
bins=[0, low_thresh, high_thresh, float('inf')],
labels=['Low', 'Medium', 'High']
)
print(customer_summary.head(10))
print(f"\nCustomer segments:\n{customer_summary['segment'].value_counts()}")
# ── Analysis 5: Day-of-week revenue pattern ──────────────────────────────────
print("\n=== Revenue by Day of Week ===")
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
dow_analysis = df.groupby('day_of_week').agg(
avg_revenue = ('revenue', 'mean'),
total_revenue = ('revenue', 'sum'),
transactions = ('transaction_id', 'count')
).round(2)
# Reorder by day of week
dow_analysis = dow_analysis.reindex(
[d for d in day_order if d in dow_analysis.index]
)
print(dow_analysis)
# ── Analysis 6: Filter — focus on high-value categories ──────────────────────
print("\n=== High-Value Categories (avg revenue > 100) ===")
high_value = df.groupby('category').filter(
lambda g: g['revenue'].mean() > 100
)
print(f"Transactions in high-value categories: {len(high_value):,}")
print(f"Categories kept: {high_value['category'].unique()}")This comprehensive analysis demonstrates the full power of Pandas grouping and aggregation: regional breakdowns, category performance rankings, customer segmentation, pivot table cross-tabulations, day-of-week pattern analysis, and group filtering — all flowing naturally from a single transaction-level DataFrame.
12. Performance Tips for Large Datasets
When working with DataFrames of millions of rows, the efficiency of your groupby() operations becomes important.
12.1 Use Categorical Dtype for Grouping Columns
If your grouping columns have a small number of unique values (like region, category, department), converting them to category dtype dramatically reduces memory usage and speeds up groupby() operations:
df['region'] = df['region'].astype('category')
df['category'] = df['category'].astype('category')
# GroupBy on categorical columns is significantly faster
result = df.groupby(['region', 'category'])['revenue'].sum()12.2 Use observed=True for Categorical GroupBy
When grouping by categorical columns, Pandas by default includes all possible categories even if some combinations do not appear in the data. Use observed=True to include only categories that actually appear:
result = df.groupby(['region', 'category'], observed=True)['revenue'].sum()12.3 Prefer Built-in Functions Over Custom Lambdas
Built-in aggregation functions like 'mean', 'sum', 'count' are implemented in optimized Cython/C code and are significantly faster than equivalent Python lambda functions. Reserve lambdas for computations that genuinely cannot be expressed with built-in functions.
# Faster: built-in
df.groupby('region')['revenue'].mean()
# Slower: equivalent lambda
df.groupby('region')['revenue'].agg(lambda x: x.mean())12.4 Avoid iterrows() in Group Processing
If you find yourself iterating over groups with a for loop and then using iterrows() inside the loop, step back and ask whether the same result can be achieved with transform() or agg(). Nested iteration is almost always a sign that a vectorized solution exists and should be used instead.
13. Common Mistakes and How to Avoid Them
13.1 Forgetting to Call an Aggregation Function
A very common beginner mistake is printing the GroupBy object itself and wondering why it does not show data:
# WRONG — this just shows the GroupBy object description
grouped = employees.groupby('Department')
print(grouped) # <DataFrameGroupBy object at 0x...>
# CORRECT — call an aggregation method
print(grouped['Salary'].mean())13.2 Confusing count() and size()
These two methods seem similar but behave differently:
df_with_nan = pd.DataFrame({
'dept': ['Eng', 'Eng', 'HR', 'HR'],
'sal': [80000, np.nan, 65000, 70000]
})
# count() counts NON-NULL values per column
print(df_with_nan.groupby('dept')['sal'].count())
# dept
# Eng 1 ← only 1 non-null value
# HR 2
# size() counts ALL rows including NaN
print(df_with_nan.groupby('dept').size())
# dept
# Eng 2 ← 2 total rows
# HR 2Use count() when you want to know how many valid (non-missing) values exist per group. Use size() when you want the total number of rows per group regardless of nulls.
13.3 The MultiIndex Column Problem After agg()
When you use agg() with a dictionary and multiple functions per column, the result has a MultiIndex column structure that can be annoying to work with. Named aggregations (agg(new_name=('col', 'func'))) are almost always a cleaner solution:
# AWKWARD: MultiIndex columns
result = df.groupby('region').agg({'revenue': ['mean', 'sum']})
# Accessing: result[('revenue', 'mean')] — verbose and confusing
# CLEAN: Named aggregations
result = df.groupby('region').agg(
avg_revenue=('revenue', 'mean'),
total_revenue=('revenue', 'sum')
)
# Accessing: result['avg_revenue'] — clean and readable13.4 Applying transform() When You Need agg()
Remember that agg() reduces each group to one row, while transform() preserves the original shape. Using the wrong one produces confusing results:
# agg() — one row per group (use for summary tables)
dept_mean_agg = employees.groupby('Department')['Salary'].agg('mean')
print(dept_mean_agg.shape) # (3,) — one value per department
# transform() — one value per original row (use to add group stats to rows)
dept_mean_tr = employees.groupby('Department')['Salary'].transform('mean')
print(dept_mean_tr.shape) # (7,) — one value per employee rowConclusion: GroupBy Is Your Most Powerful Analytical Tool
The groupby() method is arguably the single most powerful tool in the Pandas library. It transforms raw transaction-level data into meaningful summaries, enables sophisticated feature engineering through transform(), supports complex group-level filtering through filter(), and integrates seamlessly with pivot tables for multi-dimensional analysis.
In this article, you learned the split-apply-combine mental model that underlies all groupby operations, how to group by single and multiple columns, how to apply single and multiple aggregation functions using agg(), the modern named aggregation syntax that produces clean, readable results, how transform() enables group-aware feature engineering while preserving the DataFrame’s shape, how filter() removes entire groups based on conditions, how to group time series data using date components and resample(), how to build pivot tables and cross-tabulations for two-dimensional summaries, and practical performance tips for large datasets.
You also worked through a complete retail sales analysis that demonstrates how these techniques combine in realistic data science workflows. The patterns you learned here — group by category, aggregate multiple metrics, add group statistics back to rows, filter by group conditions — appear in virtually every real-world data science project.
In the next article, you will learn about merging and joining DataFrames in Pandas — the technique for combining data from multiple tables, which is the natural next step after you have learned to summarize individual tables with groupby.
Key Takeaways
groupby()implements the split-apply-combine pattern: split data into groups, apply a function to each group, and combine the results.- Calling
groupby()returns a lazy GroupBy object; the computation only happens when you call an aggregation method like.mean(),.sum(), or.agg(). - Named aggregations with
agg(new_name=('column', 'function'))produce clean, flat DataFrames with intuitive column names — prefer this over dictionary-styleagg(). transform()returns a result with the same shape as the original DataFrame, making it perfect for adding group-level statistics as new columns for feature engineering.filter()keeps or removes entire groups based on a group-level condition, unlike boolean indexing which operates on individual rows.- Use
resample()for time-based aggregation when your DataFrame has a DatetimeIndex. pivot_table()creates two-dimensional summary tables with rows and columns representing different grouping variables.count()counts non-null values per group;size()counts all rows including nulls — choose carefully.- Convert grouping columns to
categorydtype and useobserved=Truefor faster groupby operations on large datasets. - Prefer built-in aggregation functions (
'mean','sum','count') over lambda functions for performance.








