Working with Excel Files in Pandas

Learn to read and write Excel files in pandas. Master read_excel, multiple sheets, ExcelWriter, formatting with openpyxl, common problems, and best practices for data science.

Working with Excel Files in Pandas

pandas reads Excel files with pd.read_excel("file.xlsx") and writes them with df.to_excel("output.xlsx", index=False). For multiple sheets, sheet_name=None returns all sheets as a dictionary of DataFrames, and pd.ExcelWriter writes multiple DataFrames to separate sheets in a single file. The openpyxl engine (for .xlsx) and xlrd engine (for older .xls files) power Excel I/O in pandas, with openpyxl also enabling cell-level formatting, formulas, and styling beyond what pandas alone provides.

Introduction

Excel is the world’s most widely used data tool. Virtually every organization stores and shares data in Excel files — financial reports, survey results, sales records, experiment logs, client lists, configuration tables. As a data scientist, you’ll encounter Excel files constantly: receiving them from colleagues, clients, and data providers; needing to clean and analyze their contents; and often expected to deliver your results back in Excel format.

pandas has robust Excel support built in. Whether you’re reading a single clean table, extracting data from one of a dozen sheets, dealing with headers buried three rows down, or writing a polished multi-sheet report with formatted summaries, pandas — and its underlying library openpyxl — handles it all.

This article covers the complete spectrum of Excel work in pandas: reading single and multiple sheets with all the important options, handling messy real-world Excel files with non-standard layouts, writing DataFrames back to Excel with precise control, creating multi-sheet workbooks, adding basic formatting, and navigating the common problems that arise when working with Excel files programmatically.

Setting Up: Required Libraries

Python
pip install pandas openpyxl xlrd
  • pandas: The primary interface for reading and writing Excel files
  • openpyxl: The engine pandas uses for .xlsx files (Excel 2007 and newer) — also used directly for advanced formatting
  • xlrd: Required for reading older .xls format files (Excel 97-2003)

Check your versions:

Python
import pandas as pd
import openpyxl

print(f"pandas: {pd.__version__}")
print(f"openpyxl: {openpyxl.__version__}")

Reading Excel Files: pd.read_excel()

The Basic Pattern

Python
import pandas as pd

# Read the first (default) sheet of an Excel file
df = pd.read_excel("data/sales_report.xlsx")

print(df.shape)
print(df.dtypes)
print(df.head())

pd.read_excel() infers data types automatically — numbers become int64 or float64, text becomes object, and dates become datetime64[ns]. Always inspect df.dtypes after reading to verify types were inferred correctly.

Specifying the Sheet

Python
# By sheet name (string)
df = pd.read_excel("data/workbook.xlsx", sheet_name="Q3 Sales")

# By index (0-based integer — 0 is the first sheet)
df = pd.read_excel("data/workbook.xlsx", sheet_name=0)   # First sheet
df = pd.read_excel("data/workbook.xlsx", sheet_name=2)   # Third sheet

# Read ALL sheets — returns dict of {sheet_name: DataFrame}
all_sheets = pd.read_excel("data/workbook.xlsx", sheet_name=None)
print(list(all_sheets.keys()))  # ['Q1 Sales', 'Q2 Sales', 'Q3 Sales', 'Summary']

# Access individual sheets
q3_df = all_sheets["Q3 Sales"]

# Read specific named sheets as a dict
selected = pd.read_excel(
    "data/workbook.xlsx",
    sheet_name=["Q1 Sales", "Q3 Sales"]
)

Controlling Which Rows and Columns to Read

Real Excel files are rarely perfectly structured. Headers might start on row 3, footnotes at the bottom need skipping, and only a subset of columns are relevant:

Python
# Skip the first 2 rows (e.g., a title row and blank row before the header)
df = pd.read_excel("data/report.xlsx", skiprows=2)

# Skip specific rows by index (0-based)
df = pd.read_excel("data/report.xlsx", skiprows=[0, 1, 3])  # Skip rows 0, 1, 3

# Header is on row 3 (0-indexed row 2)
df = pd.read_excel("data/report.xlsx", header=2)

# Skip N rows from the bottom (footnotes, totals, etc.)
df = pd.read_excel("data/report.xlsx", skipfooter=3)

# Read only the first N rows of data (after header)
df = pd.read_excel("data/report.xlsx", nrows=100)

# Read specific columns by letter range or column index
df = pd.read_excel("data/report.xlsx", usecols="A:E")       # Columns A through E
df = pd.read_excel("data/report.xlsx", usecols="A,C,E:G")   # A, C, and E through G
df = pd.read_excel("data/report.xlsx", usecols=[0, 2, 4])   # Columns by index
df = pd.read_excel("data/report.xlsx", usecols=["Name", "Revenue", "Region"])  # By name

Controlling Data Types and Parsing

Python
# Explicitly specify column data types to avoid misinterpretation
df = pd.read_excel(
    "data/customers.xlsx",
    dtype={
        "customer_id":  str,     # Keep as string, not integer (avoids: '00123' → 123)
        "phone":        str,     # Phone numbers as strings
        "zip_code":     str,     # ZIP codes as strings
        "amount":       float,
        "is_active":    bool
    }
)

# Parse date columns automatically
df = pd.read_excel(
    "data/orders.xlsx",
    parse_dates=["order_date", "delivery_date"]
)

# Handle Excel's numeric date storage
# Excel stores dates as integers (days since Jan 1, 1900)
# pandas usually handles this automatically, but sometimes needs help:
df = pd.read_excel(
    "data/legacy_export.xlsx",
    parse_dates=["date_column"],
    date_format="%Y-%m-%d"     # Specify format if needed
)

Handling Multiple Header Rows (MultiIndex)

Some Excel files use two or more header rows to create hierarchical column labels:

Plaintext
              Jan 2024          Feb 2024
Region   Revenue   Units    Revenue   Units
East      45000    142       48000    158
West      38000    119       41000    126
Python
# header=[0, 1] tells pandas to use the first two rows as a multi-level header
df = pd.read_excel("data/multi_header_report.xlsx", header=[0, 1])

print(df.columns)
# MultiIndex([('Jan 2024', 'Revenue'), ('Jan 2024', 'Units'),
#             ('Feb 2024', 'Revenue'), ('Feb 2024', 'Units')],)

# Access columns with tuple notation
print(df[("Jan 2024", "Revenue")])

# Flatten multi-level headers to single-level
df.columns = ["_".join(col).strip() for col in df.columns]
print(df.columns.tolist())
# ['Jan 2024_Revenue', 'Jan 2024_Units', 'Feb 2024_Revenue', 'Feb 2024_Units']

Handling Missing Values

Excel files use various representations for missing data — blank cells, the string “N/A”, “n/a”, “NULL”, “-“, etc. Tell pandas to treat them all as NaN:

Python
df = pd.read_excel(
    "data/survey_results.xlsx",
    na_values=["N/A", "n/a", "NULL", "null", "-", "--", "none", "None", "NA", ""]
)

# Check result
print(df.isnull().sum())

A Complete read_excel() Parameter Reference

The most useful pd.read_excel() parameters in a single view:

ParameterTypeDescriptionExample
iostr/PathFile path or URL"data/sales.xlsx"
sheet_namestr/int/list/NoneSheet to read"Sales", 0, None (all)
headerint/listRow(s) for column names0 (default), 2, [0,1]
skiprowsint/listRows to skip before header2, [0, 1, 3]
skipfooterintRows to skip at bottom3
nrowsintMax rows to read1000
usecolsstr/listColumns to read"A:E", ["Name","Rev"]
dtypedictColumn data types{"id": str, "amt": float}
parse_dateslistColumns to parse as dates["order_date"]
na_valueslist/strAdditional NA strings["N/A", "-", "NULL"]
index_colint/strColumn to use as row index0, "customer_id"
enginestrParsing engine to use"openpyxl" (xlsx), "xlrd" (xls)

Reading Real-World Messy Excel Files

Clean Excel files with one table starting at row 1 are the exception. Real files are messy. Here are patterns for common situations.

Pattern 1: Title and Metadata Before the Data

Many reports have a company logo, report title, generation date, and other metadata in the first several rows:

Plaintext
Row 1: Acme Corp — Sales Report
Row 2: Generated: 2024-09-15
Row 3: (blank)
Row 4: Region | Product | Q1 Revenue | Q2 Revenue | ...
Row 5: East   | Widget A | 45000      | 48000      | ...
Python
# Preview to find the header row
df_raw = pd.read_excel("data/sales_report.xlsx", header=None, nrows=10)
print(df_raw)

# Skip the 3 title/metadata rows, use row 4 (index 3) as header
df = pd.read_excel("data/sales_report.xlsx", skiprows=3)
# pandas uses row 4 as the header automatically after skipping 3 rows

Pattern 2: Multiple Tables on One Sheet

Sometimes analysts put multiple separate tables on a single sheet:

Plaintext
Row 1-6:   Summary table
Row 7:     (blank)
Row 8-20:  Detail table
Python
# Read each table separately by specifying row ranges
summary_df = pd.read_excel("data/report.xlsx", nrows=5)           # Rows 2-6 (5 rows)
detail_df   = pd.read_excel("data/report.xlsx", skiprows=7)       # Rows 8 onward

# Or read the full sheet and split by blank rows
df_full = pd.read_excel("data/report.xlsx", header=None)
blank_rows = df_full[df_full.isnull().all(axis=1)].index.tolist()
print(f"Blank rows (separators): {blank_rows}")  # [6]

table1 = df_full.iloc[:blank_rows[0]].dropna(how="all")
table2 = df_full.iloc[blank_rows[0]+1:].dropna(how="all")
# Set first row as header for each table
table1.columns = table1.iloc[0]
table1 = table1.iloc[1:].reset_index(drop=True)

Pattern 3: Merged Cells

Merged cells are a common pain point — pandas unmerges them, filling continuation rows with NaN:

Plaintext
Excel display:          What pandas reads:
Region  | Sales         Region  | Sales
East    | 100           East    | 100
        | 200           NaN     | 200   ← NaN where merged cell continues
        | 150           NaN     | 150
West    | 80            West    | 80
        | 90            NaN     | 90
Python
df = pd.read_excel("data/merged_regions.xlsx")
print(df)
# Region  Sales
# East    100
# NaN     200
# NaN     150
# West    80
# NaN     90

# Fix: forward-fill the NaN values created by merged cells
df["Region"] = df["Region"].fillna(method="ffill")
# Or equivalently:
df["Region"] = df["Region"].ffill()

print(df)
# Region  Sales
# East    100
# East    200
# East    150
# West    80
# West    90

Pattern 4: Reading All Sheets and Combining

When a workbook has one sheet per time period (monthly, quarterly, yearly):

Python
import pandas as pd

def read_and_combine_sheets(filepath: str,
                            add_sheet_column: bool = True,
                            sheet_column_name: str = "sheet") -> pd.DataFrame:
    """
    Read all sheets from an Excel file and combine into one DataFrame.

    Parameters
    ----------
    filepath : str
        Path to the Excel workbook.
    add_sheet_column : bool
        If True, add a column with the originating sheet name.
    sheet_column_name : str
        Name for the sheet name column.

    Returns
    -------
    pd.DataFrame
        All sheets combined vertically.
    """
    all_sheets = pd.read_excel(filepath, sheet_name=None)
    dfs = []

    for sheet_name, df in all_sheets.items():
        if df.empty:
            continue
        if add_sheet_column:
            df[sheet_column_name] = sheet_name
        dfs.append(df)

    if not dfs:
        raise ValueError("No non-empty sheets found")

    combined = pd.concat(dfs, ignore_index=True)
    print(f"Combined {len(dfs)} sheets: {combined.shape}")
    return combined

# Combine all monthly sales sheets into one DataFrame
df_all = read_and_combine_sheets("data/monthly_sales_2024.xlsx",
                                  sheet_column_name="month")
print(df_all["month"].value_counts())

Writing Excel Files: to_excel() and ExcelWriter

Basic Writing

Python
import pandas as pd

df = pd.DataFrame({
    "customer_id":  ["CUST_001", "CUST_002", "CUST_003"],
    "name":         ["Jane Smith", "Bob Johnson", "Alice Williams"],
    "total_spend":  [384.96, 139.98, 269.97],
    "signup_date":  pd.to_datetime(["2022-03-14", "2021-11-28", "2023-01-05"]),
    "is_premium":   [True, False, True]
})

# Simple write — always set index=False to avoid writing the row numbers as a column
df.to_excel("output/customers.xlsx", index=False)

# Write to a specific sheet with a custom name
df.to_excel(
    "output/customers.xlsx",
    sheet_name="Customer Data",
    index=False
)

# Freeze the header row (stays visible when scrolling)
# This requires ExcelWriter — see next section

Writing Multiple Sheets with ExcelWriter

pd.ExcelWriter is the tool for writing multiple DataFrames to separate sheets in a single Excel file:

Python
import pandas as pd

# Sample DataFrames
summary_df = pd.DataFrame({
    "metric": ["Total Customers", "Total Revenue", "Avg Order Value", "Premium %"],
    "value":  [8, 1029.84, 102.98, 50.0]
})

customers_df = pd.DataFrame({
    "customer_id": ["CUST_001", "CUST_002", "CUST_003"],
    "name":        ["Jane Smith", "Bob Johnson", "Alice Williams"],
    "city":        ["Austin", "Seattle", "Chicago"],
    "total_spend": [384.96, 139.98, 269.97]
})

orders_df = pd.DataFrame({
    "order_id":   ["ORD_1001", "ORD_1002", "ORD_1003"],
    "customer_id":["CUST_001", "CUST_002", "CUST_001"],
    "amount":     [199.98, 49.99, 149.99],
    "status":     ["delivered", "delivered", "delivered"]
})

# Write all three DataFrames to separate sheets
with pd.ExcelWriter("output/full_report.xlsx", engine="openpyxl") as writer:
    summary_df.to_excel(writer, sheet_name="Summary",   index=False)
    customers_df.to_excel(writer, sheet_name="Customers", index=False)
    orders_df.to_excel(writer, sheet_name="Orders",     index=False)

print("Multi-sheet workbook written successfully.")

The with block ensures the file is properly saved and closed even if an error occurs during writing.

Appending to an Existing Excel File

Python
# Open an existing file and add a new sheet without touching existing sheets
with pd.ExcelWriter(
    "output/full_report.xlsx",
    engine="openpyxl",
    mode="a"                    # "a" = append mode (don't overwrite)
) as writer:
    new_predictions_df.to_excel(writer, sheet_name="Churn Predictions", index=False)

print("New sheet added to existing workbook.")

Important: If you open an existing Excel file with mode="a" and write to a sheet name that already exists, it raises an error. Use if_sheet_exists="replace" to overwrite:

Python
with pd.ExcelWriter(
    "output/report.xlsx",
    engine="openpyxl",
    mode="a",
    if_sheet_exists="replace"  # Overwrite if sheet already exists
) as writer:
    updated_summary.to_excel(writer, sheet_name="Summary", index=False)

Writing to a Specific Position (startrow, startcol)

Write a DataFrame at a specific row and column offset — useful when adding a title row or multiple tables on one sheet:

Python
with pd.ExcelWriter("output/positioned_report.xlsx", engine="openpyxl") as writer:
    # Write a title manually (we'll add it via openpyxl formatting)
    # Leave row 0 for the title, start data at row 1 (Excel row 2)
    df.to_excel(
        writer,
        sheet_name="Report",
        startrow=1,      # Start data at row index 1 (Excel row 2)
        startcol=0,
        index=False
    )

    # Write a second table to the right of the first
    summary_df.to_excel(
        writer,
        sheet_name="Report",
        startrow=1,
        startcol=df.shape[1] + 2,   # Two columns gap to the right
        index=False
    )

Formatting Excel Output with openpyxl

pd.to_excel() writes clean data but produces plain, unformatted spreadsheets. For professional-looking output with headers styled, columns auto-sized, and number formats applied, use openpyxl directly after writing with ExcelWriter.

Accessing the Workbook After Writing

Python
import pandas as pd
from openpyxl import load_workbook
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
from openpyxl.utils import get_column_letter

# Step 1: Write data with pandas
df.to_excel("output/formatted_report.xlsx", sheet_name="Data", index=False)

# Step 2: Open with openpyxl for formatting
wb = load_workbook("output/formatted_report.xlsx")
ws = wb["Data"]

Styling Headers

Python
# Style the header row
header_font = Font(bold=True, color="FFFFFF", size=11)
header_fill = PatternFill(fill_type="solid", start_color="2F75B6")  # Blue background
header_alignment = Alignment(horizontal="center", vertical="center")

for cell in ws[1]:   # Row 1 is the header row (1-indexed in openpyxl)
    cell.font = header_font
    cell.fill = header_fill
    cell.alignment = header_alignment

# Set header row height
ws.row_dimensions[1].height = 20

Auto-Sizing Column Widths

openpyxl doesn’t auto-fit columns natively, but this function calculates appropriate widths:

Python
def auto_size_columns(worksheet, min_width: int = 10, max_width: int = 40):
    """
    Auto-size all columns in a worksheet based on cell content length.
    """
    for column_cells in worksheet.columns:
        max_length = 0
        col_letter = get_column_letter(column_cells[0].column)

        for cell in column_cells:
            if cell.value is not None:
                try:
                    cell_length = len(str(cell.value))
                    max_length = max(max_length, cell_length)
                except Exception:
                    pass

        adjusted_width = max(min_width, min(max_length + 2, max_width))
        worksheet.column_dimensions[col_letter].width = adjusted_width

auto_size_columns(ws)

Number Formatting

Python
# Apply number formats to specific columns
# openpyxl uses Excel's built-in number format codes

for row in ws.iter_rows(min_row=2, min_col=3, max_col=3):  # Column C, all data rows
    for cell in row:
        cell.number_format = "$#,##0.00"    # Currency with 2 decimal places

for row in ws.iter_rows(min_row=2, min_col=4, max_col=4):  # Column D — percentages
    for cell in row:
        cell.number_format = "0.0%"

for row in ws.iter_rows(min_row=2, min_col=5, max_col=5):  # Column E — dates
    for cell in row:
        cell.number_format = "YYYY-MM-DD"

Adding Freeze Panes

Freeze the header row (and optionally first column) so they stay visible when scrolling:

Python
# Freeze row 1 (header stays visible when scrolling down)
ws.freeze_panes = "A2"

# Freeze both header row AND first column
ws.freeze_panes = "B2"

# Freeze only the first column
ws.freeze_panes = "B1"

Adding Alternating Row Colors (Zebra Stripes)

Python
from openpyxl.styles import PatternFill

light_fill = PatternFill(fill_type="solid", start_color="EBF5FB")  # Light blue
white_fill = PatternFill(fill_type="solid", start_color="FFFFFF")  # White

for row_idx in range(2, ws.max_row + 1):   # Start from row 2 (skip header)
    fill = light_fill if row_idx % 2 == 0 else white_fill
    for cell in ws[row_idx]:
        cell.fill = fill

Complete Formatting Example

Python
import pandas as pd
from openpyxl import load_workbook
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
from openpyxl.utils import get_column_letter

def write_formatted_excel(df: pd.DataFrame, filepath: str,
                           sheet_name: str = "Data",
                           title: str = None) -> None:
    """
    Write a DataFrame to a professionally formatted Excel file.

    Parameters
    ----------
    df : pd.DataFrame
        Data to write.
    filepath : str
        Output file path.
    sheet_name : str
        Name for the sheet.
    title : str, optional
        Title to display above the table.
    """
    # Write data with pandas
    start_row = 2 if title else 0   # Leave row 1 for title if provided
    df.to_excel(filepath, sheet_name=sheet_name,
                startrow=start_row, index=False)

    # Open for formatting
    wb = load_workbook(filepath)
    ws = wb[sheet_name]
    header_row = start_row + 1   # Excel rows are 1-indexed

    # Add title if provided
    if title:
        ws["A1"] = title
        ws["A1"].font = Font(bold=True, size=14)
        ws["A1"].alignment = Alignment(horizontal="left")
        ws.row_dimensions[1].height = 22

    # Style header row
    header_font  = Font(bold=True, color="FFFFFF", size=11)
    header_fill  = PatternFill(fill_type="solid", start_color="2F75B6")
    header_align = Alignment(horizontal="center", vertical="center")

    for cell in ws[header_row]:
        cell.font      = header_font
        cell.fill      = header_fill
        cell.alignment = header_align
    ws.row_dimensions[header_row].height = 18

    # Alternating row colors for data rows
    light_fill = PatternFill(fill_type="solid", start_color="DEEAF1")
    for row_idx in range(header_row + 1, ws.max_row + 1):
        fill = light_fill if (row_idx - header_row) % 2 == 0 else None
        if fill:
            for cell in ws[row_idx]:
                cell.fill = fill

    # Freeze the header row
    ws.freeze_panes = f"A{header_row + 1}"

    # Auto-size columns
    for column_cells in ws.columns:
        max_len = 0
        col_letter = get_column_letter(column_cells[0].column)
        for cell in column_cells:
            if cell.value:
                max_len = max(max_len, len(str(cell.value)))
        ws.column_dimensions[col_letter].width = min(max(max_len + 2, 10), 40)

    wb.save(filepath)
    print(f"Formatted Excel written: {filepath}")


# Usage
df_customers = pd.DataFrame({
    "Customer ID":  ["CUST_001", "CUST_002", "CUST_003"],
    "Name":         ["Jane Smith", "Bob Johnson", "Alice Williams"],
    "City":         ["Austin", "Seattle", "Chicago"],
    "Total Spend":  [384.96, 139.98, 269.97],
    "Premium":      [True, False, True],
    "Last Order":   pd.to_datetime(["2024-09-10", "2024-09-01", "2024-09-15"])
})

write_formatted_excel(
    df_customers,
    "output/customer_report_formatted.xlsx",
    title="Customer Summary — September 2024"
)

Common Problems and Solutions

Problem 1: Mixed Data Types in a Column

Excel allows mixed types within a column (some cells text, others numbers). pandas infers a type per column, which can cause unexpected results:

Python
# Column "zip_code" inferred as int64 — drops leading zeros
# CUST_001: 02134 → 2134 (Boston ZIP incorrectly truncated)

# Fix: Force string type for columns that look numeric but aren't
df = pd.read_excel("data/customers.xlsx", dtype={"zip_code": str})

# Result: "02134" preserved as string

Problem 2: Dates Stored as Numbers

Excel sometimes exports dates as serial numbers (days since 1900-01-01) rather than date strings:

Python
# Column reads as 45183 instead of 2023-09-15

# Fix 1: Parse dates automatically
df = pd.read_excel("data/orders.xlsx", parse_dates=["order_date"])

# Fix 2: If auto-parsing fails, convert the Excel date serial manually
from datetime import datetime, timedelta

def excel_date_to_datetime(excel_serial: int) -> datetime:
    """Convert Excel date serial number to Python datetime."""
    # Excel serial 1 = Jan 1, 1900; but Excel incorrectly treats 1900 as a leap year
    return datetime(1899, 12, 30) + timedelta(days=int(excel_serial))

df["order_date"] = df["order_date"].apply(excel_date_to_datetime)

Problem 3: Formula Cells Return NaN

When an Excel file contains formulas, openpyxl reads the cached formula string, not the calculated value, when opening normally:

Python
from openpyxl import load_workbook

# Reading a file with formulas
wb = load_workbook("data/formula_report.xlsx")
ws = wb.active
print(ws["B5"].value)  # Returns "=SUM(B2:B4)" — the formula string!

# Fix: Use data_only=True to read the cached calculated values
wb = load_workbook("data/formula_report.xlsx", data_only=True)
ws = wb.active
print(ws["B5"].value)  # Returns 157.98 — the cached calculated value
# Note: cached values are from the last time the file was opened in Excel

For data science purposes, use data_only=True when you want the calculated values rather than the formulas.

Problem 4: File Already Open in Excel (Windows)

On Windows, if the Excel file is open in the Excel application, Python will fail to write to it:

Python
try:
    df.to_excel("output/report.xlsx", index=False)
except PermissionError:
    print("ERROR: The file is open in Excel. Close it and try again.")
    # Or write to a differently named file:
    df.to_excel("output/report_new.xlsx", index=False)

Problem 5: Unnamed Columns from Merged Headers

When a workbook has merged header cells, the merged region typically shows the value in the first cell only, leaving the subsequent cells empty — which pandas reads as Unnamed: 1, Unnamed: 2, etc.:

Python
# Read the file first to see the actual column names
df_check = pd.read_excel("data/merged_header_report.xlsx", nrows=2)
print(df_check.columns.tolist())
# ['Region', 'Unnamed: 1', 'Q1 Revenue', 'Q2 Revenue', ...]

# Rename columns manually after reading
df = pd.read_excel("data/merged_header_report.xlsx", skiprows=1)
df.columns = ["Region", "Product", "Q1_Revenue", "Q2_Revenue", "Q3_Revenue"]

# Or use a forward-fill approach on the header row
df_raw = pd.read_excel("data/merged_header_report.xlsx", header=None, nrows=2)
headers = df_raw.iloc[0].fillna(method="ffill")  # Forward-fill the merged cells
print(headers.tolist())

Problem 6: Large Excel Files Running Slowly

Excel files with hundreds of thousands of rows load slowly because Excel format is not designed for performance. Solutions:

Python
# Option 1: Read only the columns you need
df = pd.read_excel("data/large_report.xlsx", usecols=["date", "revenue", "region"])

# Option 2: Convert to CSV or Parquet first (much faster for large data)
# If you have to work with a huge Excel file repeatedly:
import pandas as pd

# Do this once
df = pd.read_excel("data/massive_file.xlsx")
df.to_parquet("data/massive_file.parquet")  # Save as Parquet

# All subsequent reads are 10-100x faster
df = pd.read_parquet("data/massive_file.parquet")

# Option 3: Use read_only mode in openpyxl for very large files
from openpyxl import load_workbook

wb = load_workbook("data/massive_file.xlsx", read_only=True)
ws = wb.active

# Iterate rows without loading whole file into memory
for row in ws.iter_rows(values_only=True):
    process_row(row)

wb.close()

Working with Excel Tables (ListObjects)

Excel has a native “Table” feature (Insert → Table) that creates structured table objects with named ranges and automatic filtering. openpyxl can work with these:

Python
from openpyxl import load_workbook
from openpyxl.worksheet.table import Table, TableStyleInfo

# Read data from a named Excel table
wb = load_workbook("data/workbook_with_tables.xlsx")
ws = wb.active

# List all tables in the sheet
for table_name, table in ws.tables.items():
    print(f"Table: {table_name}, Range: {table.ref}")

# Create an Excel Table when writing
with pd.ExcelWriter("output/table_report.xlsx", engine="openpyxl") as writer:
    df.to_excel(writer, sheet_name="Sales", index=False)

# Open the file to add Table formatting
wb = load_workbook("output/table_report.xlsx")
ws = wb["Sales"]

# Add an Excel Table over the data range
tab = Table(
    displayName="SalesTable",
    ref=f"A1:{get_column_letter(ws.max_column)}{ws.max_row}"
)

# Apply a built-in table style
style = TableStyleInfo(
    name="TableStyleMedium9",  # Blue medium style
    showFirstColumn=False,
    showLastColumn=False,
    showRowStripes=True,
    showColumnStripes=False
)
tab.tableStyleInfo = style
ws.add_table(tab)
ws.freeze_panes = "A2"

wb.save("output/table_report.xlsx")

A Complete Pipeline Example

A realistic end-to-end example: read messy Excel input, clean it, analyze it, and write formatted output:

Python
import pandas as pd
from openpyxl import load_workbook
from openpyxl.styles import Font, PatternFill, Alignment
from openpyxl.utils import get_column_letter
from pathlib import Path

def process_sales_excel(input_path: str, output_path: str) -> dict:
    """
    Complete pipeline: read → clean → analyze → write formatted output.

    Reads a messy monthly sales Excel file (title row, merged regions,
    multiple sheets), produces a clean summary workbook.
    """

    # ── Step 1: Read all sheets ────────────────────────────────────
    print(f"Reading {input_path}...")
    all_sheets = pd.read_excel(input_path, sheet_name=None)
    print(f"Sheets found: {list(all_sheets.keys())}")

    # ── Step 2: Combine monthly sheets ────────────────────────────
    monthly_dfs = []
    for sheet_name, df in all_sheets.items():
        if df.empty or sheet_name.lower() == "summary":
            continue
        # Each sheet has 2 title rows — skip them
        df = pd.read_excel(input_path, sheet_name=sheet_name,
                           skiprows=2, dtype={"product_id": str})
        df["month"] = sheet_name
        monthly_dfs.append(df)

    combined = pd.concat(monthly_dfs, ignore_index=True)
    print(f"Combined: {combined.shape}")

    # ── Step 3: Clean ──────────────────────────────────────────────
    combined.columns = combined.columns.str.strip().str.lower().str.replace(" ", "_")
    combined = combined.dropna(subset=["product_id", "revenue"])
    combined["revenue"] = pd.to_numeric(combined["revenue"], errors="coerce")
    combined["units"]   = pd.to_numeric(combined["units"],   errors="coerce")
    combined["date"]    = pd.to_datetime(combined["date"],   errors="coerce")

    # ── Step 4: Analyze ────────────────────────────────────────────
    by_region = (
        combined
        .groupby("region")
        .agg(
            total_revenue=("revenue", "sum"),
            total_units=("units", "sum"),
            avg_order=("revenue", "mean"),
            num_transactions=("product_id", "count")
        )
        .round(2)
        .sort_values("total_revenue", ascending=False)
        .reset_index()
    )

    by_product = (
        combined
        .groupby(["product_id", "product_name"])
        .agg(
            total_revenue=("revenue", "sum"),
            total_units=("units", "sum")
        )
        .round(2)
        .sort_values("total_revenue", ascending=False)
        .reset_index()
    )

    # ── Step 5: Write formatted output ────────────────────────────
    Path(output_path).parent.mkdir(parents=True, exist_ok=True)

    with pd.ExcelWriter(output_path, engine="openpyxl") as writer:
        combined.to_excel(writer,    sheet_name="Raw Data",        index=False)
        by_region.to_excel(writer,   sheet_name="By Region",       index=False)
        by_product.to_excel(writer,  sheet_name="By Product",      index=False)

    # Apply formatting
    wb = load_workbook(output_path)
    for sheet_name in wb.sheetnames:
        ws = wb[sheet_name]
        # Style header row
        for cell in ws[1]:
            cell.font  = Font(bold=True, color="FFFFFF")
            cell.fill  = PatternFill("solid", start_color="2F75B6")
            cell.alignment = Alignment(horizontal="center")
        # Freeze header
        ws.freeze_panes = "A2"
        # Auto-size columns
        for col in ws.columns:
            col_letter = get_column_letter(col[0].column)
            max_len = max((len(str(c.value)) for c in col if c.value), default=8)
            ws.column_dimensions[col_letter].width = min(max_len + 2, 40)

    wb.save(output_path)
    print(f"Output written: {output_path}")

    return {
        "total_revenue":    combined["revenue"].sum(),
        "total_units":      combined["units"].sum(),
        "num_transactions": len(combined),
        "num_regions":      combined["region"].nunique()
    }

# Run the pipeline
# metrics = process_sales_excel(
#     "data/raw/sales_2024.xlsx",
#     "output/sales_analysis_2024.xlsx"
# )
# print(f"Total revenue: ${metrics['total_revenue']:,.2f}")

Summary

Excel file handling in pandas combines the powerful data manipulation capabilities of pandas with openpyxl’s Excel-specific features. pd.read_excel() handles the full range of real-world Excel files — from clean single-table files to messy multi-header, merged-cell, multi-table workbooks — through its extensive parameter set. pd.ExcelWriter writes clean, multi-sheet workbooks, and openpyxl adds the professional formatting layer (styled headers, auto-sized columns, freeze panes, number formats) that makes output presentable to non-technical stakeholders.

The practical pattern for most data science Excel work is: read with pd.read_excel() using the appropriate parameters for the file’s structure, clean and transform in pandas, write with ExcelWriter, and optionally apply formatting with openpyxl. The common problems — mixed types, Excel serial dates, formula cells, merged headers — each have straightforward fixes that become automatic with experience.

Key Takeaways

  • pd.read_excel("file.xlsx") reads the first sheet; use sheet_name="Name" for a specific sheet, or sheet_name=None to read all sheets as a dict of DataFrames
  • Always specify dtype={"id": str} for columns like customer IDs, ZIP codes, and phone numbers — without it, pandas may infer them as integers, truncating leading zeros and breaking joins
  • skiprows, header, usecols, nrows, and skipfooter handle virtually any non-standard Excel layout — inspect the file with nrows=10, header=None first to understand the structure
  • Forward-fill (df["col"].ffill()) fixes NaN values created by merged cells that pandas unmerges on read
  • pd.ExcelWriter with engine="openpyxl" writes multiple DataFrames to separate sheets; use mode="a" to append new sheets to an existing file without disturbing existing ones
  • openpyxl post-processing adds professional formatting: Font, PatternFill, Alignment for header styling; freeze_panes to keep headers visible while scrolling; auto-sized column widths; and number formats for currency, percentages, and dates
  • load_workbook("file.xlsx", data_only=True) reads cached formula results rather than formula strings — use this when you need the calculated values from a formula-heavy workbook
  • For large Excel files, convert to Parquet once (df.to_parquet()) and read Parquet for all subsequent operations — the speed difference is 10–100×
Share:
Subscribe
Notify of
0 Comments
Inline Feedbacks
View all comments

Discover More

Introduction to Model Evaluation Metrics

Master machine learning evaluation metrics including accuracy, precision, recall, F1-score, ROC-AUC, RMSE, and more with…

Probability Theory Fundamentals for Machine Learning

Master probability theory fundamentals essential for machine learning. Learn probability distributions, conditional probability, Bayes’ theorem,…

Parallel Circuits Demystified: Creating Multiple Paths for Current

Master parallel circuits with this comprehensive guide. Learn how components connect in parallel, calculate current…

Samsung Announces Massive AI Expansion Targeting 800 Million Mobile Devices in 2026

Samsung announces aggressive AI strategy to double Galaxy AI-enabled devices to 800 million by 2026.…

JavaScript Functions: Declaration, Invocation and Parameters

Learn about JavaScript functions, including declarations, invocations, parameters, and handling asynchronous tasks with callbacks, promises…

What is Continuity Testing and Why is it Your Best Debugging Friend?

Learn what continuity testing is, how to use it with a multimeter, and why it’s…

Click For More
0
Would love your thoughts, please comment.x
()
x