Grouping and Aggregating Data with Pandas

Master Pandas groupby to group, summarize, and aggregate data. Learn agg(), transform(), filter(), and pivot tables with real-world Python examples.

Grouping and Aggregating Data with Pandas

Grouping and aggregating data in Pandas means splitting a DataFrame into groups based on one or more columns, applying a summary function (like sum, mean, or count) to each group, and combining the results into a new table. The groupby() method is the primary tool for this workflow, enabling you to answer questions like “What is the average salary per department?” or “What are the total monthly sales by region?” in just one or two lines of code.

Introduction: Why Grouping and Aggregation Are Central to Data Analysis

Imagine you have a dataset with one million sales transactions recorded over three years, each row representing a single purchase with columns for date, product category, region, sales amount, and customer ID. Raw transaction-level data like this is almost impossible to interpret on its own. What you really want to know is the total revenue per region, the average order value per product category, the number of transactions per month, or which customer segment generates the most profit.

Answering these kinds of questions requires aggregation — summarizing many individual data points into a smaller number of meaningful statistics. And aggregation almost always follows grouping — partitioning the data into subsets based on one or more variables before computing the summary. This split-apply-combine pattern is one of the most fundamental operations in all of data analysis, from exploratory data analysis to reporting dashboards to feature engineering for machine learning.

Pandas provides an exceptionally powerful and flexible toolset for grouping and aggregating data, centered around the groupby() method. Once you master groupby(), you will find yourself reaching for it constantly — it is the analytical backbone of most real-world Pandas workflows. This article will take you from the basic mechanics of groupby() through advanced techniques including multiple aggregations, custom functions, named aggregations, transformations, filtering, and pivot tables.

1. The Split-Apply-Combine Pattern

Before diving into code, it helps to understand the conceptual model that Pandas groupby() is built on. Computer scientist Hadley Wickham formalized this pattern in his influential 2011 paper, and it applies universally across data manipulation tools:

Split: Divide the DataFrame into groups based on the unique values of one or more columns (the grouping keys).

Apply: Apply a function independently to each group. The function might compute a summary statistic (aggregation), transform each value within the group, or filter the group based on some condition.

Combine: Collect the results from each group and combine them into a single output DataFrame or Series.

Here is a visual illustration of the pattern:

Plaintext
Original DataFrame:
   Department   Salary
0  Engineering  85000
1  Marketing    72000
2  Engineering  78000
3  HR           65000
4  Marketing    68000
5  Engineering  92000

SPLIT by Department:
  Engineering: [85000, 78000, 92000]
  Marketing:   [72000, 68000]
  HR:          [65000]

APPLY mean():
  Engineering: 85000.00
  Marketing:   70000.00
  HR:          65000.00

COMBINE:
  Department    Salary
  Engineering   85000.00
  HR            65000.00
  Marketing     70000.00

This three-step mental model applies to every groupby() operation you will ever write, no matter how complex.

2. Getting Started with groupby()

2.1 Creating Your First GroupBy Object

The groupby() method returns a GroupBy object — not a DataFrame. This object holds the instructions for how to split the data, but it has not yet performed any computation. Computation happens when you call an aggregation method on the GroupBy object.

Python
import pandas as pd
import numpy as np

# Sample employee dataset
employees = pd.DataFrame({
    'Name':       ['Alice', 'Bob', 'Charlie', 'Diana', 'Edward', 'Fiona', 'George'],
    'Department': ['Engineering', 'Marketing', 'Engineering', 'HR',
                   'Marketing', 'Engineering', 'HR'],
    'Gender':     ['F', 'M', 'M', 'F', 'M', 'F', 'M'],
    'Salary':     [85000, 72000, 78000, 65000, 68000, 92000, 70000],
    'Years_Exp':  [4, 7, 3, 8, 5, 6, 9],
    'Remote':     [True, False, True, True, False, True, False]
})

# Create a GroupBy object
grouped = employees.groupby('Department')
print(type(grouped))
# <class 'pandas.core.groupby.generic.DataFrameGroupBy'>

# The GroupBy object is lazy — call an aggregation to get results
print(grouped['Salary'].mean())
# Department
# Engineering    85000.000000
# HR             67500.000000
# Marketing      70000.000000
# Name: Salary, dtype: float64

Notice that the result is a Series whose index is the unique department names. This is the standard output format for a single-column aggregation after a groupby().

2.2 Inspecting a GroupBy Object

Before computing aggregations, it can be helpful to inspect the groups themselves:

Python
# See which row indices belong to each group
print(grouped.groups)
# {'Engineering': [0, 2, 5], 'HR': [3, 6], 'Marketing': [1, 4]}

# Count the number of rows in each group
print(grouped.size())
# Department
# Engineering    3
# HR             2
# Marketing      2
# dtype: int64

# Get a specific group as a DataFrame
eng_group = grouped.get_group('Engineering')
print(eng_group)
#       Name   Department Gender  Salary  Years_Exp  Remote
# 0    Alice  Engineering      F   85000          4    True
# 2  Charlie  Engineering      M   78000          3    True
# 5    Fiona  Engineering      F   92000          6    True

2.3 Common Single-Function Aggregations

Pandas GroupBy objects expose all common aggregation functions directly as methods:

Python
# Mean salary per department
print(employees.groupby('Department')['Salary'].mean())

# Total salary budget per department
print(employees.groupby('Department')['Salary'].sum())

# Maximum salary per department
print(employees.groupby('Department')['Salary'].max())

# Minimum salary per department
print(employees.groupby('Department')['Salary'].min())

# Count of employees per department
print(employees.groupby('Department')['Salary'].count())

# Standard deviation of salary per department
print(employees.groupby('Department')['Salary'].std())

# Median salary per department
print(employees.groupby('Department')['Salary'].median())

Each of these returns a Series indexed by the grouping column. The count() function counts non-null values (useful for detecting missing data within groups), while size() on the GroupBy object counts all rows including nulls.

3. Grouping by Multiple Columns

One of the most powerful features of groupby() is the ability to group by more than one column simultaneously, creating a hierarchical structure of groups.

Python
# Group by both Department and Gender
multi_grouped = employees.groupby(['Department', 'Gender'])
print(multi_grouped['Salary'].mean())

# Department   Gender
# Engineering  F         88500.0
#              M         78000.0
# HR           F         65000.0
#              M         70000.0
# Marketing    M         70000.0
# Name: Salary, dtype: float64

The result has a MultiIndex — a hierarchical index with Department at the outer level and Gender at the inner level. This is a natural representation of grouped data at two levels.

3.1 Resetting the Index After Groupby

Often you want the grouping columns to be regular columns in the result rather than part of the index. Use reset_index() for this:

Python
result = employees.groupby(['Department', 'Gender'])['Salary'].mean().reset_index()
print(result)
#    Department Gender    Salary
# 0  Engineering      F   88500.0
# 1  Engineering      M   78000.0
# 2           HR      F   65000.0
# 3           HR      M   70000.0
# 4    Marketing      M   70000.0

Now the result is a clean, flat DataFrame that is easy to use for further analysis or visualization.

3.2 The as_index Parameter

Alternatively, you can pass as_index=False directly to groupby() to get a flat DataFrame without a subsequent reset_index() call:

Python
result = employees.groupby('Department', as_index=False)['Salary'].mean()
print(result)
#    Department    Salary
# 0  Engineering  85000.0
# 1           HR  67500.0
# 2    Marketing  70000.0

This is a convenient shorthand when you know in advance that you want a flat result.

4. Multiple Aggregations with agg()

Computing a single summary statistic per group is useful, but real analysis usually requires multiple statistics at once. The agg() method (short for aggregate) lets you apply multiple functions simultaneously.

4.1 Applying Multiple Functions to One Column

Python
# Multiple statistics for salary per department
salary_stats = employees.groupby('Department')['Salary'].agg(
    ['mean', 'median', 'std', 'min', 'max', 'count']
)
print(salary_stats)

#               mean  median          std    min    max  count
# Department
# Engineering  85000  85000  7000.000000  78000  92000      3
# HR           67500  67500  3535.533906  65000  70000      2
# Marketing    70000  70000  2828.427125  68000  72000      2

4.2 Applying Different Functions to Different Columns

This is where agg() truly shines. You can specify a dictionary mapping each column to its aggregation function(s):

Python
result = employees.groupby('Department').agg({
    'Salary':    ['mean', 'max'],
    'Years_Exp': ['mean', 'min'],
    'Remote':    'sum'    # Count of remote workers (True = 1)
})
print(result)

#              Salary          Years_Exp       Remote
#                mean    max      mean  min      sum
# Department
# Engineering  85000  92000  4.333333    3        3
# HR           67500  70000  8.500000    8        1
# Marketing    70000  72000  6.000000    5        0

The result has a MultiIndex column structure — a hierarchy of (column name, function name). This can be useful but is sometimes cumbersome to work with. Named aggregations solve this.

4.3 Named Aggregations with agg() — The Modern Approach

Introduced in Pandas 0.25, named aggregations give each result column a custom name directly within the agg() call, producing a clean flat DataFrame:

Python
result = employees.groupby('Department').agg(
    avg_salary    = ('Salary',    'mean'),
    max_salary    = ('Salary',    'max'),
    avg_exp       = ('Years_Exp', 'mean'),
    remote_count  = ('Remote',    'sum'),
    headcount     = ('Name',      'count')
)
print(result)

#              avg_salary  max_salary   avg_exp  remote_count  headcount
# Department
# Engineering     85000.0       92000  4.333333             3          3
# HR              67500.0       70000  8.500000             1          2
# Marketing       70000.0       72000  6.000000             0          2

Named aggregations are the recommended modern approach. The syntax new_column_name=('source_column', 'function') is clean, readable, and produces intuitive column names without needing to rename columns afterward.

4.4 Using Custom Functions in agg()

Beyond built-in aggregation functions like 'mean' and 'sum', you can pass any Python function — including lambda functions or custom-defined functions:

Python
# Custom aggregation: salary range (max - min)
def salary_range(series):
    return series.max() - series.min()

result = employees.groupby('Department').agg(
    avg_salary    = ('Salary', 'mean'),
    salary_range  = ('Salary', salary_range),
    pct_remote    = ('Remote', lambda x: x.mean() * 100)  # % remote
)
print(result)

#              avg_salary  salary_range  pct_remote
# Department
# Engineering     85000.0       14000.0  100.000000
# HR              67500.0        5000.0   50.000000
# Marketing       70000.0        4000.0    0.000000

Custom functions receive each group’s column data as a Series and should return a single scalar value.

5. The transform() Method: Group-Aware Transformations

While agg() reduces each group to a single summary value, transform() returns a result with the same shape as the original DataFrame — one value per row, computed within each group. This makes transform() ideal for feature engineering and for adding group-level statistics as new columns.

5.1 Adding Group Means Back to the DataFrame

Python
# Add department average salary as a new column
employees['dept_avg_salary'] = employees.groupby('Department')['Salary'].transform('mean')

print(employees[['Name', 'Department', 'Salary', 'dept_avg_salary']])
#       Name   Department  Salary  dept_avg_salary
# 0    Alice  Engineering   85000         85000.000
# 1      Bob    Marketing   72000         70000.000
# 2  Charlie  Engineering   78000         85000.000
# 3    Diana           HR   65000         67500.000
# 4   Edward    Marketing   68000         70000.000
# 5    Fiona  Engineering   92000         85000.000
# 6   George           HR   70000         67500.000

Each employee’s row now shows their department’s average salary, perfectly aligned with the original index.

5.2 Computing Salary Deviation from Department Mean

This is a classic feature engineering technique — measuring how much each individual deviates from their group’s central tendency:

Python
employees['salary_deviation'] = (
    employees['Salary'] - employees.groupby('Department')['Salary'].transform('mean')
)

print(employees[['Name', 'Department', 'Salary', 'salary_deviation']])
#       Name   Department  Salary  salary_deviation
# 0    Alice  Engineering   85000               0.0
# 1      Bob    Marketing   72000            2000.0
# 2  Charlie  Engineering   78000           -7000.0
# 3    Diana           HR   65000           -2500.0
# 4   Edward    Marketing   68000           -2000.0
# 5    Fiona  Engineering   92000            7000.0
# 6   George           HR   70000            2500.0

Positive values indicate above-average salaries within the department; negative values indicate below-average. This kind of within-group normalization is extremely common in machine learning feature engineering.

5.3 Group Rank with transform()

You can also use transform() with rank() to compute each employee’s salary rank within their department:

Python
employees['dept_salary_rank'] = employees.groupby('Department')['Salary'].rank(
    ascending=False, method='dense'
)
print(employees[['Name', 'Department', 'Salary', 'dept_salary_rank']])
#       Name   Department  Salary  dept_salary_rank
# 0    Alice  Engineering   85000               2.0
# 1      Bob    Marketing   72000               1.0
# 2  Charlie  Engineering   78000               3.0
# 3    Diana           HR   65000               2.0
# 4   Edward    Marketing   68000               2.0
# 5    Fiona  Engineering   92000               1.0
# 6   George           HR   70000               1.0

5.4 Group-Aware Normalization (Z-Score)

Z-score normalization within groups (also called standardization) is a powerful preprocessing technique that puts all features on a comparable scale while accounting for group differences:

Python
from scipy import stats

# Z-score normalization within each department
employees['salary_zscore'] = employees.groupby('Department')['Salary'].transform(
    lambda x: stats.zscore(x, ddof=1) if len(x) > 1 else 0
)

print(employees[['Name', 'Department', 'Salary', 'salary_zscore']].round(3))

6. The filter() Method: Keeping or Removing Entire Groups

The filter() method allows you to keep or discard entire groups based on a condition applied to each group. Unlike boolean indexing on individual rows, filter() operates at the group level — if the condition is True for a group, all rows in that group are kept; if False, the entire group is dropped.

Python
# Keep only departments with average salary above 75,000
high_paying_depts = employees.groupby('Department').filter(
    lambda group: group['Salary'].mean() > 75000
)
print(high_paying_depts[['Name', 'Department', 'Salary']])
#       Name   Department  Salary
# 0    Alice  Engineering   85000
# 2  Charlie  Engineering   78000
# 5    Fiona  Engineering   92000

Only Engineering employees are kept (average salary $85,000), while HR ($67,500) and Marketing ($70,000) are excluded.

Python
# Keep only departments with 3 or more employees
large_depts = employees.groupby('Department').filter(
    lambda group: len(group) >= 3
)
print(large_depts)
# Only Engineering rows (3 employees); HR and Marketing each have 2
Python
# Keep departments where at least one employee has 8+ years of experience
exp_depts = employees.groupby('Department').filter(
    lambda group: group['Years_Exp'].max() >= 8
)
print(exp_depts[['Name', 'Department', 'Years_Exp']])
# HR (George: 9 years, Diana: 8 years) meets the condition
# Engineering and Marketing do not

The filter() method is particularly useful for removing small or unrepresentative groups from your analysis — for example, filtering out product categories with fewer than 10 orders before computing average order values.

7. Iterating Over Groups

Sometimes you need to process each group individually — for example, to apply a complex transformation, generate a separate report per group, or train a separate model per segment. The GroupBy object is iterable, yielding (group_name, group_dataframe) tuples:

Python
for dept_name, dept_df in employees.groupby('Department'):
    avg = dept_df['Salary'].mean()
    top_earner = dept_df.loc[dept_df['Salary'].idxmax(), 'Name']
    print(f"\n{dept_name}:")
    print(f"  Employees: {len(dept_df)}")
    print(f"  Avg Salary: ${avg:,.0f}")
    print(f"  Top Earner: {top_earner}")

# Engineering:
#   Employees: 3
#   Avg Salary: $85,000
#   Top Earner: Fiona

# HR:
#   Employees: 2
#   Avg Salary: $67,500
#   Top Earner: George

# Marketing:
#   Employees: 2
#   Avg Salary: $70,000
#   Top Earner: Bob

While iteration is flexible, it is generally slower than vectorized agg() or transform() operations. Reserve it for cases where you genuinely need to process each group as an independent DataFrame.

8. Grouping with Time-Based Data

Grouping time series data is one of the most common real-world applications of Pandas groupby(). You can group by components of a datetime column (year, month, day, weekday) or use the resample() method for time-based aggregation.

8.1 Grouping by Date Components

Python
import pandas as pd
import numpy as np

# Simulate daily sales data over 2 years
np.random.seed(42)
dates = pd.date_range('2023-01-01', '2024-12-31', freq='D')
sales_df = pd.DataFrame({
    'date':     dates,
    'sales':    np.random.normal(1000, 200, len(dates)).round(2),
    'region':   np.random.choice(['North', 'South', 'East', 'West'], len(dates)),
    'category': np.random.choice(['Electronics', 'Clothing', 'Food'], len(dates))
})

# Extract date components
sales_df['year']    = sales_df['date'].dt.year
sales_df['month']   = sales_df['date'].dt.month
sales_df['weekday'] = sales_df['date'].dt.day_name()
sales_df['quarter'] = sales_df['date'].dt.quarter

# Monthly average sales
monthly_avg = sales_df.groupby(['year', 'month'])['sales'].mean().round(2)
print(monthly_avg.head(6))
# year  month
# 2023  1        1005.47
#       2         997.83
#       3        1001.22
# ...

# Sales by day of week (which day generates most revenue?)
dow_sales = sales_df.groupby('weekday')['sales'].agg(
    avg_sales='mean',
    total_sales='sum',
    count='count'
).round(2)
print(dow_sales.sort_values('avg_sales', ascending=False))

8.2 Using resample() for Time-Based Grouping

When your DataFrame has a DatetimeIndex, resample() is a more elegant alternative to groupby() for time-based aggregation:

Python
# Set date as index
sales_ts = sales_df.set_index('date')['sales']

# Resample to monthly total
monthly_total = sales_ts.resample('ME').sum()
print(monthly_total.head())
# date
# 2023-01-31    30726.84
# 2023-02-28    27938.64
# 2023-03-31    31042.18
# ...

# Resample to quarterly mean
quarterly_mean = sales_ts.resample('QE').mean()
print(quarterly_mean)

# Resample to weekly and compute multiple stats
weekly_stats = sales_ts.resample('W').agg(['mean', 'sum', 'count'])
print(weekly_stats.head())

resample() uses time-frequency aliases: 'D' for daily, 'W' for weekly, 'ME' for month-end, 'QE' for quarter-end, 'YE' for year-end. It is the go-to tool for time series aggregation.

9. Pivot Tables: A Different View of Grouped Data

A pivot table is a two-dimensional summary table where the rows represent one grouping variable, the columns represent another grouping variable, and the cell values contain an aggregated statistic. Pivot tables are what most business analysts mean when they talk about “cross-tabulation” or “cross-tab” analysis.

9.1 Creating Pivot Tables with pivot_table()

Python
import pandas as pd
import numpy as np

# Sales by region and category
np.random.seed(42)
n = 500
sales = pd.DataFrame({
    'region':   np.random.choice(['North', 'South', 'East', 'West'], n),
    'category': np.random.choice(['Electronics', 'Clothing', 'Food'], n),
    'sales':    np.random.normal(500, 150, n).round(2),
    'units':    np.random.randint(1, 20, n)
})

# Pivot: rows=region, columns=category, values=mean sales
pivot = sales.pivot_table(
    values='sales',
    index='region',
    columns='category',
    aggfunc='mean'
).round(2)

print(pivot)
# category  Clothing  Electronics    Food
# region
# East        499.61       498.34  500.12
# North       507.23       495.18  502.44
# South       493.87       510.76  498.93
# West        501.45       503.22  497.56

9.2 Pivot Tables with Multiple Values and Margins

Python
# Multiple aggregation functions and adding row/column totals
pivot_full = sales.pivot_table(
    values=['sales', 'units'],
    index='region',
    columns='category',
    aggfunc={'sales': 'sum', 'units': 'mean'},
    margins=True,          # Add row and column totals/averages
    margins_name='Total'   # Label for the totals row/column
).round(2)

print(pivot_full)

The margins=True option adds a “Total” row and column that aggregates across all groups, similar to the “Grand Total” in a spreadsheet pivot table.

9.3 Cross-Tabulation with crosstab()

pd.crosstab() is a specialized function for counting the frequency of combinations of categorical values — essentially a pivot table with aggfunc='count':

Python
# Count employees by department and gender
crosstab = pd.crosstab(employees['Department'], employees['Gender'])
print(crosstab)
# Gender       F  M
# Department
# Engineering  2  1
# HR           1  1
# Marketing    0  2

# With row and column percentages
crosstab_pct = pd.crosstab(
    employees['Department'],
    employees['Gender'],
    normalize='index'    # Normalize each row to sum to 1
).round(3)
print(crosstab_pct)
# Gender         F      M
# Department
# Engineering  0.667  0.333
# HR           0.500  0.500
# Marketing    0.000  1.000

10. Practical Comparison Table: GroupBy Methods

Understanding when to use each method is crucial. Here is a comprehensive comparison:

MethodReturnsShapePrimary Use CaseExample
groupby().agg()DataFrame or SeriesReduced (one row per group)Summary statistics per groupTotal sales per region
groupby().transform()Series or DataFrameSame as originalAdd group stats as new columnsAdd group mean to each row
groupby().filter()DataFrameSubset of original rowsRemove entire groups by conditionKeep only large departments
groupby().apply()FlexibleDepends on functionComplex custom group-level logicFit model per group
pivot_table()DataFrame2D summary tableCross-tabular summariesSales by region and category
crosstab()DataFrame2D frequency tableFrequency of categorical combinationsCount by gender and department
resample()Series or DataFrameTime-resampledTime series aggregationMonthly average from daily data

11. Real-World Example: Retail Sales Analysis

Let us apply everything from this article to a realistic end-to-end analysis of a retail sales dataset:

Python
import pandas as pd
import numpy as np

# Simulate retail transactions
np.random.seed(123)
n = 2000

df = pd.DataFrame({
    'transaction_id': range(1, n + 1),
    'date':           pd.date_range('2024-01-01', periods=n, freq='3h'),
    'customer_id':    np.random.randint(1001, 1201, n),
    'region':         np.random.choice(['North', 'South', 'East', 'West'], n,
                                        p=[0.30, 0.25, 0.25, 0.20]),
    'category':       np.random.choice(['Electronics', 'Clothing', 'Food', 'Sports'], n,
                                        p=[0.25, 0.30, 0.30, 0.15]),
    'units_sold':     np.random.randint(1, 15, n),
    'unit_price':     np.random.choice([9.99, 24.99, 49.99, 99.99, 199.99], n),
    'discount_pct':   np.random.choice([0, 0.05, 0.10, 0.20], n, p=[0.5, 0.2, 0.2, 0.1])
})

# Feature engineering
df['revenue']    = (df['units_sold'] * df['unit_price'] * (1 - df['discount_pct'])).round(2)
df['month']      = df['date'].dt.to_period('M')
df['quarter']    = df['date'].dt.to_period('Q')
df['day_of_week']= df['date'].dt.day_name()

print(f"Dataset: {df.shape[0]:,} transactions | {df['revenue'].sum():,.2f} total revenue")
print(f"Date range: {df['date'].min().date()} to {df['date'].max().date()}")

# ── Analysis 1: Revenue summary by region ────────────────────────────────────
print("\n=== Revenue by Region ===")
region_summary = df.groupby('region').agg(
    transactions  = ('transaction_id', 'count'),
    total_revenue = ('revenue',        'sum'),
    avg_revenue   = ('revenue',        'mean'),
    avg_discount  = ('discount_pct',   'mean')
).round(2)
region_summary['revenue_share_%'] = (
    region_summary['total_revenue'] / region_summary['total_revenue'].sum() * 100
).round(1)
print(region_summary.sort_values('total_revenue', ascending=False))

# ── Analysis 2: Category performance ────────────────────────────────────────
print("\n=== Category Performance ===")
category_stats = df.groupby('category').agg(
    total_units   = ('units_sold',  'sum'),
    total_revenue = ('revenue',     'sum'),
    avg_price     = ('unit_price',  'mean'),
    transactions  = ('transaction_id', 'count')
).round(2)
category_stats['revenue_per_unit'] = (
    category_stats['total_revenue'] / category_stats['total_units']
).round(2)
print(category_stats.sort_values('total_revenue', ascending=False))

# ── Analysis 3: Region × Category cross-tab ─────────────────────────────────
print("\n=== Revenue by Region and Category (Pivot Table) ===")
revenue_pivot = df.pivot_table(
    values='revenue',
    index='region',
    columns='category',
    aggfunc='sum',
    margins=True,
    margins_name='Total'
).round(0)
print(revenue_pivot)

# ── Analysis 4: Customer analysis with transform ─────────────────────────────
print("\n=== Customer Spending Analysis ===")
df['customer_total_spend'] = df.groupby('customer_id')['revenue'].transform('sum')
df['customer_avg_spend']   = df.groupby('customer_id')['revenue'].transform('mean')
df['customer_txn_count']   = df.groupby('customer_id')['revenue'].transform('count')

# Classify customers by total spend
spend_quantiles = df.groupby('customer_id')['revenue'].sum().quantile([0.33, 0.67])
low_thresh, high_thresh = spend_quantiles[0.33], spend_quantiles[0.67]

customer_summary = df.groupby('customer_id').agg(
    total_spend   = ('revenue', 'sum'),
    avg_order     = ('revenue', 'mean'),
    num_orders    = ('transaction_id', 'count'),
    top_category  = ('category', lambda x: x.mode()[0])
).round(2)

customer_summary['segment'] = pd.cut(
    customer_summary['total_spend'],
    bins=[0, low_thresh, high_thresh, float('inf')],
    labels=['Low', 'Medium', 'High']
)
print(customer_summary.head(10))
print(f"\nCustomer segments:\n{customer_summary['segment'].value_counts()}")

# ── Analysis 5: Day-of-week revenue pattern ──────────────────────────────────
print("\n=== Revenue by Day of Week ===")
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
dow_analysis = df.groupby('day_of_week').agg(
    avg_revenue     = ('revenue', 'mean'),
    total_revenue   = ('revenue', 'sum'),
    transactions    = ('transaction_id', 'count')
).round(2)
# Reorder by day of week
dow_analysis = dow_analysis.reindex(
    [d for d in day_order if d in dow_analysis.index]
)
print(dow_analysis)

# ── Analysis 6: Filter — focus on high-value categories ──────────────────────
print("\n=== High-Value Categories (avg revenue > 100) ===")
high_value = df.groupby('category').filter(
    lambda g: g['revenue'].mean() > 100
)
print(f"Transactions in high-value categories: {len(high_value):,}")
print(f"Categories kept: {high_value['category'].unique()}")

This comprehensive analysis demonstrates the full power of Pandas grouping and aggregation: regional breakdowns, category performance rankings, customer segmentation, pivot table cross-tabulations, day-of-week pattern analysis, and group filtering — all flowing naturally from a single transaction-level DataFrame.

12. Performance Tips for Large Datasets

When working with DataFrames of millions of rows, the efficiency of your groupby() operations becomes important.

12.1 Use Categorical Dtype for Grouping Columns

If your grouping columns have a small number of unique values (like region, category, department), converting them to category dtype dramatically reduces memory usage and speeds up groupby() operations:

Python
df['region']   = df['region'].astype('category')
df['category'] = df['category'].astype('category')

# GroupBy on categorical columns is significantly faster
result = df.groupby(['region', 'category'])['revenue'].sum()

12.2 Use observed=True for Categorical GroupBy

When grouping by categorical columns, Pandas by default includes all possible categories even if some combinations do not appear in the data. Use observed=True to include only categories that actually appear:

Python
result = df.groupby(['region', 'category'], observed=True)['revenue'].sum()

12.3 Prefer Built-in Functions Over Custom Lambdas

Built-in aggregation functions like 'mean', 'sum', 'count' are implemented in optimized Cython/C code and are significantly faster than equivalent Python lambda functions. Reserve lambdas for computations that genuinely cannot be expressed with built-in functions.

Python
# Faster: built-in
df.groupby('region')['revenue'].mean()

# Slower: equivalent lambda
df.groupby('region')['revenue'].agg(lambda x: x.mean())

12.4 Avoid iterrows() in Group Processing

If you find yourself iterating over groups with a for loop and then using iterrows() inside the loop, step back and ask whether the same result can be achieved with transform() or agg(). Nested iteration is almost always a sign that a vectorized solution exists and should be used instead.

13. Common Mistakes and How to Avoid Them

13.1 Forgetting to Call an Aggregation Function

A very common beginner mistake is printing the GroupBy object itself and wondering why it does not show data:

Python
# WRONG — this just shows the GroupBy object description
grouped = employees.groupby('Department')
print(grouped)  # <DataFrameGroupBy object at 0x...>

# CORRECT — call an aggregation method
print(grouped['Salary'].mean())

13.2 Confusing count() and size()

These two methods seem similar but behave differently:

Python
df_with_nan = pd.DataFrame({
    'dept': ['Eng', 'Eng', 'HR', 'HR'],
    'sal':  [80000, np.nan, 65000, 70000]
})

# count() counts NON-NULL values per column
print(df_with_nan.groupby('dept')['sal'].count())
# dept
# Eng    1   ← only 1 non-null value
# HR     2

# size() counts ALL rows including NaN
print(df_with_nan.groupby('dept').size())
# dept
# Eng    2   ← 2 total rows
# HR     2

Use count() when you want to know how many valid (non-missing) values exist per group. Use size() when you want the total number of rows per group regardless of nulls.

13.3 The MultiIndex Column Problem After agg()

When you use agg() with a dictionary and multiple functions per column, the result has a MultiIndex column structure that can be annoying to work with. Named aggregations (agg(new_name=('col', 'func'))) are almost always a cleaner solution:

Python
# AWKWARD: MultiIndex columns
result = df.groupby('region').agg({'revenue': ['mean', 'sum']})
# Accessing: result[('revenue', 'mean')]  — verbose and confusing

# CLEAN: Named aggregations
result = df.groupby('region').agg(
    avg_revenue=('revenue', 'mean'),
    total_revenue=('revenue', 'sum')
)
# Accessing: result['avg_revenue']  — clean and readable

13.4 Applying transform() When You Need agg()

Remember that agg() reduces each group to one row, while transform() preserves the original shape. Using the wrong one produces confusing results:

Python
# agg() — one row per group (use for summary tables)
dept_mean_agg = employees.groupby('Department')['Salary'].agg('mean')
print(dept_mean_agg.shape)   # (3,) — one value per department

# transform() — one value per original row (use to add group stats to rows)
dept_mean_tr = employees.groupby('Department')['Salary'].transform('mean')
print(dept_mean_tr.shape)    # (7,) — one value per employee row

Conclusion: GroupBy Is Your Most Powerful Analytical Tool

The groupby() method is arguably the single most powerful tool in the Pandas library. It transforms raw transaction-level data into meaningful summaries, enables sophisticated feature engineering through transform(), supports complex group-level filtering through filter(), and integrates seamlessly with pivot tables for multi-dimensional analysis.

In this article, you learned the split-apply-combine mental model that underlies all groupby operations, how to group by single and multiple columns, how to apply single and multiple aggregation functions using agg(), the modern named aggregation syntax that produces clean, readable results, how transform() enables group-aware feature engineering while preserving the DataFrame’s shape, how filter() removes entire groups based on conditions, how to group time series data using date components and resample(), how to build pivot tables and cross-tabulations for two-dimensional summaries, and practical performance tips for large datasets.

You also worked through a complete retail sales analysis that demonstrates how these techniques combine in realistic data science workflows. The patterns you learned here — group by category, aggregate multiple metrics, add group statistics back to rows, filter by group conditions — appear in virtually every real-world data science project.

In the next article, you will learn about merging and joining DataFrames in Pandas — the technique for combining data from multiple tables, which is the natural next step after you have learned to summarize individual tables with groupby.

Key Takeaways

  • groupby() implements the split-apply-combine pattern: split data into groups, apply a function to each group, and combine the results.
  • Calling groupby() returns a lazy GroupBy object; the computation only happens when you call an aggregation method like .mean(), .sum(), or .agg().
  • Named aggregations with agg(new_name=('column', 'function')) produce clean, flat DataFrames with intuitive column names — prefer this over dictionary-style agg().
  • transform() returns a result with the same shape as the original DataFrame, making it perfect for adding group-level statistics as new columns for feature engineering.
  • filter() keeps or removes entire groups based on a group-level condition, unlike boolean indexing which operates on individual rows.
  • Use resample() for time-based aggregation when your DataFrame has a DatetimeIndex.
  • pivot_table() creates two-dimensional summary tables with rows and columns representing different grouping variables.
  • count() counts non-null values per group; size() counts all rows including nulls — choose carefully.
  • Convert grouping columns to category dtype and use observed=True for faster groupby operations on large datasets.
  • Prefer built-in aggregation functions ('mean', 'sum', 'count') over lambda functions for performance.
Share:
Subscribe
Notify of
0 Comments
Inline Feedbacks
View all comments

Discover More

Essential Python Libraries for Machine Learning: A Complete Overview

Discover the essential Python libraries for machine learning including NumPy, Pandas, Scikit-learn, Matplotlib, and TensorFlow.…

EU Antitrust Scrutiny Intensifies Over AI Integration in Messaging Platforms

European regulators are examining whether built-in AI features in messaging platforms could restrict competition and…

Data Mining Tools: Weka, RapidMiner and KNIME

Discover Weka, RapidMiner and KNIME—top data mining tools for analysis, visualization and machine learning. Compare…

Intel Debuts Revolutionary Core Ultra Series 3 Processors at CES 2026 with 18A Manufacturing Breakthrough

Intel launches Core Ultra Series 3 processors at CES 2026 with groundbreaking 18A technology, delivering…

Introduction to Conditional Statements and Control Structures in C++

Learn how to use conditional statements and control structures in C++ to write efficient and…

Blue Origin Announces TeraWave: 5,408 Satellites to Challenge Starlink

Blue Origin announces TeraWave satellite network with 5,408 satellites offering 6 terabits per second speeds…

Click For More
0
Would love your thoughts, please comment.x
()
x