pandas reads Excel files with pd.read_excel("file.xlsx") and writes them with df.to_excel("output.xlsx", index=False). For multiple sheets, sheet_name=None returns all sheets as a dictionary of DataFrames, and pd.ExcelWriter writes multiple DataFrames to separate sheets in a single file. The openpyxl engine (for .xlsx) and xlrd engine (for older .xls files) power Excel I/O in pandas, with openpyxl also enabling cell-level formatting, formulas, and styling beyond what pandas alone provides.
Introduction
Excel is the world’s most widely used data tool. Virtually every organization stores and shares data in Excel files — financial reports, survey results, sales records, experiment logs, client lists, configuration tables. As a data scientist, you’ll encounter Excel files constantly: receiving them from colleagues, clients, and data providers; needing to clean and analyze their contents; and often expected to deliver your results back in Excel format.
pandas has robust Excel support built in. Whether you’re reading a single clean table, extracting data from one of a dozen sheets, dealing with headers buried three rows down, or writing a polished multi-sheet report with formatted summaries, pandas — and its underlying library openpyxl — handles it all.
This article covers the complete spectrum of Excel work in pandas: reading single and multiple sheets with all the important options, handling messy real-world Excel files with non-standard layouts, writing DataFrames back to Excel with precise control, creating multi-sheet workbooks, adding basic formatting, and navigating the common problems that arise when working with Excel files programmatically.
Setting Up: Required Libraries
pip install pandas openpyxl xlrdpandas: The primary interface for reading and writing Excel filesopenpyxl: The engine pandas uses for.xlsxfiles (Excel 2007 and newer) — also used directly for advanced formattingxlrd: Required for reading older.xlsformat files (Excel 97-2003)
Check your versions:
import pandas as pd
import openpyxl
print(f"pandas: {pd.__version__}")
print(f"openpyxl: {openpyxl.__version__}")Reading Excel Files: pd.read_excel()
The Basic Pattern
import pandas as pd
# Read the first (default) sheet of an Excel file
df = pd.read_excel("data/sales_report.xlsx")
print(df.shape)
print(df.dtypes)
print(df.head())pd.read_excel() infers data types automatically — numbers become int64 or float64, text becomes object, and dates become datetime64[ns]. Always inspect df.dtypes after reading to verify types were inferred correctly.
Specifying the Sheet
# By sheet name (string)
df = pd.read_excel("data/workbook.xlsx", sheet_name="Q3 Sales")
# By index (0-based integer — 0 is the first sheet)
df = pd.read_excel("data/workbook.xlsx", sheet_name=0) # First sheet
df = pd.read_excel("data/workbook.xlsx", sheet_name=2) # Third sheet
# Read ALL sheets — returns dict of {sheet_name: DataFrame}
all_sheets = pd.read_excel("data/workbook.xlsx", sheet_name=None)
print(list(all_sheets.keys())) # ['Q1 Sales', 'Q2 Sales', 'Q3 Sales', 'Summary']
# Access individual sheets
q3_df = all_sheets["Q3 Sales"]
# Read specific named sheets as a dict
selected = pd.read_excel(
"data/workbook.xlsx",
sheet_name=["Q1 Sales", "Q3 Sales"]
)Controlling Which Rows and Columns to Read
Real Excel files are rarely perfectly structured. Headers might start on row 3, footnotes at the bottom need skipping, and only a subset of columns are relevant:
# Skip the first 2 rows (e.g., a title row and blank row before the header)
df = pd.read_excel("data/report.xlsx", skiprows=2)
# Skip specific rows by index (0-based)
df = pd.read_excel("data/report.xlsx", skiprows=[0, 1, 3]) # Skip rows 0, 1, 3
# Header is on row 3 (0-indexed row 2)
df = pd.read_excel("data/report.xlsx", header=2)
# Skip N rows from the bottom (footnotes, totals, etc.)
df = pd.read_excel("data/report.xlsx", skipfooter=3)
# Read only the first N rows of data (after header)
df = pd.read_excel("data/report.xlsx", nrows=100)
# Read specific columns by letter range or column index
df = pd.read_excel("data/report.xlsx", usecols="A:E") # Columns A through E
df = pd.read_excel("data/report.xlsx", usecols="A,C,E:G") # A, C, and E through G
df = pd.read_excel("data/report.xlsx", usecols=[0, 2, 4]) # Columns by index
df = pd.read_excel("data/report.xlsx", usecols=["Name", "Revenue", "Region"]) # By nameControlling Data Types and Parsing
# Explicitly specify column data types to avoid misinterpretation
df = pd.read_excel(
"data/customers.xlsx",
dtype={
"customer_id": str, # Keep as string, not integer (avoids: '00123' → 123)
"phone": str, # Phone numbers as strings
"zip_code": str, # ZIP codes as strings
"amount": float,
"is_active": bool
}
)
# Parse date columns automatically
df = pd.read_excel(
"data/orders.xlsx",
parse_dates=["order_date", "delivery_date"]
)
# Handle Excel's numeric date storage
# Excel stores dates as integers (days since Jan 1, 1900)
# pandas usually handles this automatically, but sometimes needs help:
df = pd.read_excel(
"data/legacy_export.xlsx",
parse_dates=["date_column"],
date_format="%Y-%m-%d" # Specify format if needed
)Handling Multiple Header Rows (MultiIndex)
Some Excel files use two or more header rows to create hierarchical column labels:
Jan 2024 Feb 2024
Region Revenue Units Revenue Units
East 45000 142 48000 158
West 38000 119 41000 126# header=[0, 1] tells pandas to use the first two rows as a multi-level header
df = pd.read_excel("data/multi_header_report.xlsx", header=[0, 1])
print(df.columns)
# MultiIndex([('Jan 2024', 'Revenue'), ('Jan 2024', 'Units'),
# ('Feb 2024', 'Revenue'), ('Feb 2024', 'Units')],)
# Access columns with tuple notation
print(df[("Jan 2024", "Revenue")])
# Flatten multi-level headers to single-level
df.columns = ["_".join(col).strip() for col in df.columns]
print(df.columns.tolist())
# ['Jan 2024_Revenue', 'Jan 2024_Units', 'Feb 2024_Revenue', 'Feb 2024_Units']Handling Missing Values
Excel files use various representations for missing data — blank cells, the string “N/A”, “n/a”, “NULL”, “-“, etc. Tell pandas to treat them all as NaN:
df = pd.read_excel(
"data/survey_results.xlsx",
na_values=["N/A", "n/a", "NULL", "null", "-", "--", "none", "None", "NA", ""]
)
# Check result
print(df.isnull().sum())A Complete read_excel() Parameter Reference
The most useful pd.read_excel() parameters in a single view:
| Parameter | Type | Description | Example |
|---|---|---|---|
io | str/Path | File path or URL | "data/sales.xlsx" |
sheet_name | str/int/list/None | Sheet to read | "Sales", 0, None (all) |
header | int/list | Row(s) for column names | 0 (default), 2, [0,1] |
skiprows | int/list | Rows to skip before header | 2, [0, 1, 3] |
skipfooter | int | Rows to skip at bottom | 3 |
nrows | int | Max rows to read | 1000 |
usecols | str/list | Columns to read | "A:E", ["Name","Rev"] |
dtype | dict | Column data types | {"id": str, "amt": float} |
parse_dates | list | Columns to parse as dates | ["order_date"] |
na_values | list/str | Additional NA strings | ["N/A", "-", "NULL"] |
index_col | int/str | Column to use as row index | 0, "customer_id" |
engine | str | Parsing engine to use | "openpyxl" (xlsx), "xlrd" (xls) |
Reading Real-World Messy Excel Files
Clean Excel files with one table starting at row 1 are the exception. Real files are messy. Here are patterns for common situations.
Pattern 1: Title and Metadata Before the Data
Many reports have a company logo, report title, generation date, and other metadata in the first several rows:
Row 1: Acme Corp — Sales Report
Row 2: Generated: 2024-09-15
Row 3: (blank)
Row 4: Region | Product | Q1 Revenue | Q2 Revenue | ...
Row 5: East | Widget A | 45000 | 48000 | ...# Preview to find the header row
df_raw = pd.read_excel("data/sales_report.xlsx", header=None, nrows=10)
print(df_raw)
# Skip the 3 title/metadata rows, use row 4 (index 3) as header
df = pd.read_excel("data/sales_report.xlsx", skiprows=3)
# pandas uses row 4 as the header automatically after skipping 3 rowsPattern 2: Multiple Tables on One Sheet
Sometimes analysts put multiple separate tables on a single sheet:
Row 1-6: Summary table
Row 7: (blank)
Row 8-20: Detail table# Read each table separately by specifying row ranges
summary_df = pd.read_excel("data/report.xlsx", nrows=5) # Rows 2-6 (5 rows)
detail_df = pd.read_excel("data/report.xlsx", skiprows=7) # Rows 8 onward
# Or read the full sheet and split by blank rows
df_full = pd.read_excel("data/report.xlsx", header=None)
blank_rows = df_full[df_full.isnull().all(axis=1)].index.tolist()
print(f"Blank rows (separators): {blank_rows}") # [6]
table1 = df_full.iloc[:blank_rows[0]].dropna(how="all")
table2 = df_full.iloc[blank_rows[0]+1:].dropna(how="all")
# Set first row as header for each table
table1.columns = table1.iloc[0]
table1 = table1.iloc[1:].reset_index(drop=True)Pattern 3: Merged Cells
Merged cells are a common pain point — pandas unmerges them, filling continuation rows with NaN:
Excel display: What pandas reads:
Region | Sales Region | Sales
East | 100 East | 100
| 200 NaN | 200 ← NaN where merged cell continues
| 150 NaN | 150
West | 80 West | 80
| 90 NaN | 90df = pd.read_excel("data/merged_regions.xlsx")
print(df)
# Region Sales
# East 100
# NaN 200
# NaN 150
# West 80
# NaN 90
# Fix: forward-fill the NaN values created by merged cells
df["Region"] = df["Region"].fillna(method="ffill")
# Or equivalently:
df["Region"] = df["Region"].ffill()
print(df)
# Region Sales
# East 100
# East 200
# East 150
# West 80
# West 90Pattern 4: Reading All Sheets and Combining
When a workbook has one sheet per time period (monthly, quarterly, yearly):
import pandas as pd
def read_and_combine_sheets(filepath: str,
add_sheet_column: bool = True,
sheet_column_name: str = "sheet") -> pd.DataFrame:
"""
Read all sheets from an Excel file and combine into one DataFrame.
Parameters
----------
filepath : str
Path to the Excel workbook.
add_sheet_column : bool
If True, add a column with the originating sheet name.
sheet_column_name : str
Name for the sheet name column.
Returns
-------
pd.DataFrame
All sheets combined vertically.
"""
all_sheets = pd.read_excel(filepath, sheet_name=None)
dfs = []
for sheet_name, df in all_sheets.items():
if df.empty:
continue
if add_sheet_column:
df[sheet_column_name] = sheet_name
dfs.append(df)
if not dfs:
raise ValueError("No non-empty sheets found")
combined = pd.concat(dfs, ignore_index=True)
print(f"Combined {len(dfs)} sheets: {combined.shape}")
return combined
# Combine all monthly sales sheets into one DataFrame
df_all = read_and_combine_sheets("data/monthly_sales_2024.xlsx",
sheet_column_name="month")
print(df_all["month"].value_counts())Writing Excel Files: to_excel() and ExcelWriter
Basic Writing
import pandas as pd
df = pd.DataFrame({
"customer_id": ["CUST_001", "CUST_002", "CUST_003"],
"name": ["Jane Smith", "Bob Johnson", "Alice Williams"],
"total_spend": [384.96, 139.98, 269.97],
"signup_date": pd.to_datetime(["2022-03-14", "2021-11-28", "2023-01-05"]),
"is_premium": [True, False, True]
})
# Simple write — always set index=False to avoid writing the row numbers as a column
df.to_excel("output/customers.xlsx", index=False)
# Write to a specific sheet with a custom name
df.to_excel(
"output/customers.xlsx",
sheet_name="Customer Data",
index=False
)
# Freeze the header row (stays visible when scrolling)
# This requires ExcelWriter — see next sectionWriting Multiple Sheets with ExcelWriter
pd.ExcelWriter is the tool for writing multiple DataFrames to separate sheets in a single Excel file:
import pandas as pd
# Sample DataFrames
summary_df = pd.DataFrame({
"metric": ["Total Customers", "Total Revenue", "Avg Order Value", "Premium %"],
"value": [8, 1029.84, 102.98, 50.0]
})
customers_df = pd.DataFrame({
"customer_id": ["CUST_001", "CUST_002", "CUST_003"],
"name": ["Jane Smith", "Bob Johnson", "Alice Williams"],
"city": ["Austin", "Seattle", "Chicago"],
"total_spend": [384.96, 139.98, 269.97]
})
orders_df = pd.DataFrame({
"order_id": ["ORD_1001", "ORD_1002", "ORD_1003"],
"customer_id":["CUST_001", "CUST_002", "CUST_001"],
"amount": [199.98, 49.99, 149.99],
"status": ["delivered", "delivered", "delivered"]
})
# Write all three DataFrames to separate sheets
with pd.ExcelWriter("output/full_report.xlsx", engine="openpyxl") as writer:
summary_df.to_excel(writer, sheet_name="Summary", index=False)
customers_df.to_excel(writer, sheet_name="Customers", index=False)
orders_df.to_excel(writer, sheet_name="Orders", index=False)
print("Multi-sheet workbook written successfully.")The with block ensures the file is properly saved and closed even if an error occurs during writing.
Appending to an Existing Excel File
# Open an existing file and add a new sheet without touching existing sheets
with pd.ExcelWriter(
"output/full_report.xlsx",
engine="openpyxl",
mode="a" # "a" = append mode (don't overwrite)
) as writer:
new_predictions_df.to_excel(writer, sheet_name="Churn Predictions", index=False)
print("New sheet added to existing workbook.")Important: If you open an existing Excel file with mode="a" and write to a sheet name that already exists, it raises an error. Use if_sheet_exists="replace" to overwrite:
with pd.ExcelWriter(
"output/report.xlsx",
engine="openpyxl",
mode="a",
if_sheet_exists="replace" # Overwrite if sheet already exists
) as writer:
updated_summary.to_excel(writer, sheet_name="Summary", index=False)Writing to a Specific Position (startrow, startcol)
Write a DataFrame at a specific row and column offset — useful when adding a title row or multiple tables on one sheet:
with pd.ExcelWriter("output/positioned_report.xlsx", engine="openpyxl") as writer:
# Write a title manually (we'll add it via openpyxl formatting)
# Leave row 0 for the title, start data at row 1 (Excel row 2)
df.to_excel(
writer,
sheet_name="Report",
startrow=1, # Start data at row index 1 (Excel row 2)
startcol=0,
index=False
)
# Write a second table to the right of the first
summary_df.to_excel(
writer,
sheet_name="Report",
startrow=1,
startcol=df.shape[1] + 2, # Two columns gap to the right
index=False
)Formatting Excel Output with openpyxl
pd.to_excel() writes clean data but produces plain, unformatted spreadsheets. For professional-looking output with headers styled, columns auto-sized, and number formats applied, use openpyxl directly after writing with ExcelWriter.
Accessing the Workbook After Writing
import pandas as pd
from openpyxl import load_workbook
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
from openpyxl.utils import get_column_letter
# Step 1: Write data with pandas
df.to_excel("output/formatted_report.xlsx", sheet_name="Data", index=False)
# Step 2: Open with openpyxl for formatting
wb = load_workbook("output/formatted_report.xlsx")
ws = wb["Data"]Styling Headers
# Style the header row
header_font = Font(bold=True, color="FFFFFF", size=11)
header_fill = PatternFill(fill_type="solid", start_color="2F75B6") # Blue background
header_alignment = Alignment(horizontal="center", vertical="center")
for cell in ws[1]: # Row 1 is the header row (1-indexed in openpyxl)
cell.font = header_font
cell.fill = header_fill
cell.alignment = header_alignment
# Set header row height
ws.row_dimensions[1].height = 20Auto-Sizing Column Widths
openpyxl doesn’t auto-fit columns natively, but this function calculates appropriate widths:
def auto_size_columns(worksheet, min_width: int = 10, max_width: int = 40):
"""
Auto-size all columns in a worksheet based on cell content length.
"""
for column_cells in worksheet.columns:
max_length = 0
col_letter = get_column_letter(column_cells[0].column)
for cell in column_cells:
if cell.value is not None:
try:
cell_length = len(str(cell.value))
max_length = max(max_length, cell_length)
except Exception:
pass
adjusted_width = max(min_width, min(max_length + 2, max_width))
worksheet.column_dimensions[col_letter].width = adjusted_width
auto_size_columns(ws)Number Formatting
# Apply number formats to specific columns
# openpyxl uses Excel's built-in number format codes
for row in ws.iter_rows(min_row=2, min_col=3, max_col=3): # Column C, all data rows
for cell in row:
cell.number_format = "$#,##0.00" # Currency with 2 decimal places
for row in ws.iter_rows(min_row=2, min_col=4, max_col=4): # Column D — percentages
for cell in row:
cell.number_format = "0.0%"
for row in ws.iter_rows(min_row=2, min_col=5, max_col=5): # Column E — dates
for cell in row:
cell.number_format = "YYYY-MM-DD"Adding Freeze Panes
Freeze the header row (and optionally first column) so they stay visible when scrolling:
# Freeze row 1 (header stays visible when scrolling down)
ws.freeze_panes = "A2"
# Freeze both header row AND first column
ws.freeze_panes = "B2"
# Freeze only the first column
ws.freeze_panes = "B1"Adding Alternating Row Colors (Zebra Stripes)
from openpyxl.styles import PatternFill
light_fill = PatternFill(fill_type="solid", start_color="EBF5FB") # Light blue
white_fill = PatternFill(fill_type="solid", start_color="FFFFFF") # White
for row_idx in range(2, ws.max_row + 1): # Start from row 2 (skip header)
fill = light_fill if row_idx % 2 == 0 else white_fill
for cell in ws[row_idx]:
cell.fill = fillComplete Formatting Example
import pandas as pd
from openpyxl import load_workbook
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
from openpyxl.utils import get_column_letter
def write_formatted_excel(df: pd.DataFrame, filepath: str,
sheet_name: str = "Data",
title: str = None) -> None:
"""
Write a DataFrame to a professionally formatted Excel file.
Parameters
----------
df : pd.DataFrame
Data to write.
filepath : str
Output file path.
sheet_name : str
Name for the sheet.
title : str, optional
Title to display above the table.
"""
# Write data with pandas
start_row = 2 if title else 0 # Leave row 1 for title if provided
df.to_excel(filepath, sheet_name=sheet_name,
startrow=start_row, index=False)
# Open for formatting
wb = load_workbook(filepath)
ws = wb[sheet_name]
header_row = start_row + 1 # Excel rows are 1-indexed
# Add title if provided
if title:
ws["A1"] = title
ws["A1"].font = Font(bold=True, size=14)
ws["A1"].alignment = Alignment(horizontal="left")
ws.row_dimensions[1].height = 22
# Style header row
header_font = Font(bold=True, color="FFFFFF", size=11)
header_fill = PatternFill(fill_type="solid", start_color="2F75B6")
header_align = Alignment(horizontal="center", vertical="center")
for cell in ws[header_row]:
cell.font = header_font
cell.fill = header_fill
cell.alignment = header_align
ws.row_dimensions[header_row].height = 18
# Alternating row colors for data rows
light_fill = PatternFill(fill_type="solid", start_color="DEEAF1")
for row_idx in range(header_row + 1, ws.max_row + 1):
fill = light_fill if (row_idx - header_row) % 2 == 0 else None
if fill:
for cell in ws[row_idx]:
cell.fill = fill
# Freeze the header row
ws.freeze_panes = f"A{header_row + 1}"
# Auto-size columns
for column_cells in ws.columns:
max_len = 0
col_letter = get_column_letter(column_cells[0].column)
for cell in column_cells:
if cell.value:
max_len = max(max_len, len(str(cell.value)))
ws.column_dimensions[col_letter].width = min(max(max_len + 2, 10), 40)
wb.save(filepath)
print(f"Formatted Excel written: {filepath}")
# Usage
df_customers = pd.DataFrame({
"Customer ID": ["CUST_001", "CUST_002", "CUST_003"],
"Name": ["Jane Smith", "Bob Johnson", "Alice Williams"],
"City": ["Austin", "Seattle", "Chicago"],
"Total Spend": [384.96, 139.98, 269.97],
"Premium": [True, False, True],
"Last Order": pd.to_datetime(["2024-09-10", "2024-09-01", "2024-09-15"])
})
write_formatted_excel(
df_customers,
"output/customer_report_formatted.xlsx",
title="Customer Summary — September 2024"
)Common Problems and Solutions
Problem 1: Mixed Data Types in a Column
Excel allows mixed types within a column (some cells text, others numbers). pandas infers a type per column, which can cause unexpected results:
# Column "zip_code" inferred as int64 — drops leading zeros
# CUST_001: 02134 → 2134 (Boston ZIP incorrectly truncated)
# Fix: Force string type for columns that look numeric but aren't
df = pd.read_excel("data/customers.xlsx", dtype={"zip_code": str})
# Result: "02134" preserved as stringProblem 2: Dates Stored as Numbers
Excel sometimes exports dates as serial numbers (days since 1900-01-01) rather than date strings:
# Column reads as 45183 instead of 2023-09-15
# Fix 1: Parse dates automatically
df = pd.read_excel("data/orders.xlsx", parse_dates=["order_date"])
# Fix 2: If auto-parsing fails, convert the Excel date serial manually
from datetime import datetime, timedelta
def excel_date_to_datetime(excel_serial: int) -> datetime:
"""Convert Excel date serial number to Python datetime."""
# Excel serial 1 = Jan 1, 1900; but Excel incorrectly treats 1900 as a leap year
return datetime(1899, 12, 30) + timedelta(days=int(excel_serial))
df["order_date"] = df["order_date"].apply(excel_date_to_datetime)Problem 3: Formula Cells Return NaN
When an Excel file contains formulas, openpyxl reads the cached formula string, not the calculated value, when opening normally:
from openpyxl import load_workbook
# Reading a file with formulas
wb = load_workbook("data/formula_report.xlsx")
ws = wb.active
print(ws["B5"].value) # Returns "=SUM(B2:B4)" — the formula string!
# Fix: Use data_only=True to read the cached calculated values
wb = load_workbook("data/formula_report.xlsx", data_only=True)
ws = wb.active
print(ws["B5"].value) # Returns 157.98 — the cached calculated value
# Note: cached values are from the last time the file was opened in ExcelFor data science purposes, use data_only=True when you want the calculated values rather than the formulas.
Problem 4: File Already Open in Excel (Windows)
On Windows, if the Excel file is open in the Excel application, Python will fail to write to it:
try:
df.to_excel("output/report.xlsx", index=False)
except PermissionError:
print("ERROR: The file is open in Excel. Close it and try again.")
# Or write to a differently named file:
df.to_excel("output/report_new.xlsx", index=False)Problem 5: Unnamed Columns from Merged Headers
When a workbook has merged header cells, the merged region typically shows the value in the first cell only, leaving the subsequent cells empty — which pandas reads as Unnamed: 1, Unnamed: 2, etc.:
# Read the file first to see the actual column names
df_check = pd.read_excel("data/merged_header_report.xlsx", nrows=2)
print(df_check.columns.tolist())
# ['Region', 'Unnamed: 1', 'Q1 Revenue', 'Q2 Revenue', ...]
# Rename columns manually after reading
df = pd.read_excel("data/merged_header_report.xlsx", skiprows=1)
df.columns = ["Region", "Product", "Q1_Revenue", "Q2_Revenue", "Q3_Revenue"]
# Or use a forward-fill approach on the header row
df_raw = pd.read_excel("data/merged_header_report.xlsx", header=None, nrows=2)
headers = df_raw.iloc[0].fillna(method="ffill") # Forward-fill the merged cells
print(headers.tolist())Problem 6: Large Excel Files Running Slowly
Excel files with hundreds of thousands of rows load slowly because Excel format is not designed for performance. Solutions:
# Option 1: Read only the columns you need
df = pd.read_excel("data/large_report.xlsx", usecols=["date", "revenue", "region"])
# Option 2: Convert to CSV or Parquet first (much faster for large data)
# If you have to work with a huge Excel file repeatedly:
import pandas as pd
# Do this once
df = pd.read_excel("data/massive_file.xlsx")
df.to_parquet("data/massive_file.parquet") # Save as Parquet
# All subsequent reads are 10-100x faster
df = pd.read_parquet("data/massive_file.parquet")
# Option 3: Use read_only mode in openpyxl for very large files
from openpyxl import load_workbook
wb = load_workbook("data/massive_file.xlsx", read_only=True)
ws = wb.active
# Iterate rows without loading whole file into memory
for row in ws.iter_rows(values_only=True):
process_row(row)
wb.close()Working with Excel Tables (ListObjects)
Excel has a native “Table” feature (Insert → Table) that creates structured table objects with named ranges and automatic filtering. openpyxl can work with these:
from openpyxl import load_workbook
from openpyxl.worksheet.table import Table, TableStyleInfo
# Read data from a named Excel table
wb = load_workbook("data/workbook_with_tables.xlsx")
ws = wb.active
# List all tables in the sheet
for table_name, table in ws.tables.items():
print(f"Table: {table_name}, Range: {table.ref}")
# Create an Excel Table when writing
with pd.ExcelWriter("output/table_report.xlsx", engine="openpyxl") as writer:
df.to_excel(writer, sheet_name="Sales", index=False)
# Open the file to add Table formatting
wb = load_workbook("output/table_report.xlsx")
ws = wb["Sales"]
# Add an Excel Table over the data range
tab = Table(
displayName="SalesTable",
ref=f"A1:{get_column_letter(ws.max_column)}{ws.max_row}"
)
# Apply a built-in table style
style = TableStyleInfo(
name="TableStyleMedium9", # Blue medium style
showFirstColumn=False,
showLastColumn=False,
showRowStripes=True,
showColumnStripes=False
)
tab.tableStyleInfo = style
ws.add_table(tab)
ws.freeze_panes = "A2"
wb.save("output/table_report.xlsx")A Complete Pipeline Example
A realistic end-to-end example: read messy Excel input, clean it, analyze it, and write formatted output:
import pandas as pd
from openpyxl import load_workbook
from openpyxl.styles import Font, PatternFill, Alignment
from openpyxl.utils import get_column_letter
from pathlib import Path
def process_sales_excel(input_path: str, output_path: str) -> dict:
"""
Complete pipeline: read → clean → analyze → write formatted output.
Reads a messy monthly sales Excel file (title row, merged regions,
multiple sheets), produces a clean summary workbook.
"""
# ── Step 1: Read all sheets ────────────────────────────────────
print(f"Reading {input_path}...")
all_sheets = pd.read_excel(input_path, sheet_name=None)
print(f"Sheets found: {list(all_sheets.keys())}")
# ── Step 2: Combine monthly sheets ────────────────────────────
monthly_dfs = []
for sheet_name, df in all_sheets.items():
if df.empty or sheet_name.lower() == "summary":
continue
# Each sheet has 2 title rows — skip them
df = pd.read_excel(input_path, sheet_name=sheet_name,
skiprows=2, dtype={"product_id": str})
df["month"] = sheet_name
monthly_dfs.append(df)
combined = pd.concat(monthly_dfs, ignore_index=True)
print(f"Combined: {combined.shape}")
# ── Step 3: Clean ──────────────────────────────────────────────
combined.columns = combined.columns.str.strip().str.lower().str.replace(" ", "_")
combined = combined.dropna(subset=["product_id", "revenue"])
combined["revenue"] = pd.to_numeric(combined["revenue"], errors="coerce")
combined["units"] = pd.to_numeric(combined["units"], errors="coerce")
combined["date"] = pd.to_datetime(combined["date"], errors="coerce")
# ── Step 4: Analyze ────────────────────────────────────────────
by_region = (
combined
.groupby("region")
.agg(
total_revenue=("revenue", "sum"),
total_units=("units", "sum"),
avg_order=("revenue", "mean"),
num_transactions=("product_id", "count")
)
.round(2)
.sort_values("total_revenue", ascending=False)
.reset_index()
)
by_product = (
combined
.groupby(["product_id", "product_name"])
.agg(
total_revenue=("revenue", "sum"),
total_units=("units", "sum")
)
.round(2)
.sort_values("total_revenue", ascending=False)
.reset_index()
)
# ── Step 5: Write formatted output ────────────────────────────
Path(output_path).parent.mkdir(parents=True, exist_ok=True)
with pd.ExcelWriter(output_path, engine="openpyxl") as writer:
combined.to_excel(writer, sheet_name="Raw Data", index=False)
by_region.to_excel(writer, sheet_name="By Region", index=False)
by_product.to_excel(writer, sheet_name="By Product", index=False)
# Apply formatting
wb = load_workbook(output_path)
for sheet_name in wb.sheetnames:
ws = wb[sheet_name]
# Style header row
for cell in ws[1]:
cell.font = Font(bold=True, color="FFFFFF")
cell.fill = PatternFill("solid", start_color="2F75B6")
cell.alignment = Alignment(horizontal="center")
# Freeze header
ws.freeze_panes = "A2"
# Auto-size columns
for col in ws.columns:
col_letter = get_column_letter(col[0].column)
max_len = max((len(str(c.value)) for c in col if c.value), default=8)
ws.column_dimensions[col_letter].width = min(max_len + 2, 40)
wb.save(output_path)
print(f"Output written: {output_path}")
return {
"total_revenue": combined["revenue"].sum(),
"total_units": combined["units"].sum(),
"num_transactions": len(combined),
"num_regions": combined["region"].nunique()
}
# Run the pipeline
# metrics = process_sales_excel(
# "data/raw/sales_2024.xlsx",
# "output/sales_analysis_2024.xlsx"
# )
# print(f"Total revenue: ${metrics['total_revenue']:,.2f}")Summary
Excel file handling in pandas combines the powerful data manipulation capabilities of pandas with openpyxl’s Excel-specific features. pd.read_excel() handles the full range of real-world Excel files — from clean single-table files to messy multi-header, merged-cell, multi-table workbooks — through its extensive parameter set. pd.ExcelWriter writes clean, multi-sheet workbooks, and openpyxl adds the professional formatting layer (styled headers, auto-sized columns, freeze panes, number formats) that makes output presentable to non-technical stakeholders.
The practical pattern for most data science Excel work is: read with pd.read_excel() using the appropriate parameters for the file’s structure, clean and transform in pandas, write with ExcelWriter, and optionally apply formatting with openpyxl. The common problems — mixed types, Excel serial dates, formula cells, merged headers — each have straightforward fixes that become automatic with experience.
Key Takeaways
pd.read_excel("file.xlsx")reads the first sheet; usesheet_name="Name"for a specific sheet, orsheet_name=Noneto read all sheets as a dict of DataFrames- Always specify
dtype={"id": str}for columns like customer IDs, ZIP codes, and phone numbers — without it, pandas may infer them as integers, truncating leading zeros and breaking joins skiprows,header,usecols,nrows, andskipfooterhandle virtually any non-standard Excel layout — inspect the file withnrows=10, header=Nonefirst to understand the structure- Forward-fill (
df["col"].ffill()) fixes NaN values created by merged cells that pandas unmerges on read pd.ExcelWriterwithengine="openpyxl"writes multiple DataFrames to separate sheets; usemode="a"to append new sheets to an existing file without disturbing existing onesopenpyxlpost-processing adds professional formatting:Font,PatternFill,Alignmentfor header styling;freeze_panesto keep headers visible while scrolling; auto-sized column widths; and number formats for currency, percentages, and datesload_workbook("file.xlsx", data_only=True)reads cached formula results rather than formula strings — use this when you need the calculated values from a formula-heavy workbook- For large Excel files, convert to Parquet once (
df.to_parquet()) and read Parquet for all subsequent operations — the speed difference is 10–100×








