Working with Excel Files in Pandas

Learn to read and write Excel files in pandas. Master read_excel, multiple sheets, ExcelWriter, formatting with openpyxl, common problems, and best practices for data science.

Working with Excel Files in Pandas

pandas reads Excel files with pd.read_excel("file.xlsx") and writes them with df.to_excel("output.xlsx", index=False). For multiple sheets, sheet_name=None returns all sheets as a dictionary of DataFrames, and pd.ExcelWriter writes multiple DataFrames to separate sheets in a single file. The openpyxl engine (for .xlsx) and xlrd engine (for older .xls files) power Excel I/O in pandas, with openpyxl also enabling cell-level formatting, formulas, and styling beyond what pandas alone provides.

Introduction

Excel is the world’s most widely used data tool. Virtually every organization stores and shares data in Excel files — financial reports, survey results, sales records, experiment logs, client lists, configuration tables. As a data scientist, you’ll encounter Excel files constantly: receiving them from colleagues, clients, and data providers; needing to clean and analyze their contents; and often expected to deliver your results back in Excel format.

pandas has robust Excel support built in. Whether you’re reading a single clean table, extracting data from one of a dozen sheets, dealing with headers buried three rows down, or writing a polished multi-sheet report with formatted summaries, pandas — and its underlying library openpyxl — handles it all.

This article covers the complete spectrum of Excel work in pandas: reading single and multiple sheets with all the important options, handling messy real-world Excel files with non-standard layouts, writing DataFrames back to Excel with precise control, creating multi-sheet workbooks, adding basic formatting, and navigating the common problems that arise when working with Excel files programmatically.

Setting Up: Required Libraries

Python
pip install pandas openpyxl xlrd
  • pandas: The primary interface for reading and writing Excel files
  • openpyxl: The engine pandas uses for .xlsx files (Excel 2007 and newer) — also used directly for advanced formatting
  • xlrd: Required for reading older .xls format files (Excel 97-2003)

Check your versions:

Python
import pandas as pd
import openpyxl

print(f"pandas: {pd.__version__}")
print(f"openpyxl: {openpyxl.__version__}")

Reading Excel Files: pd.read_excel()

The Basic Pattern

Python
import pandas as pd

# Read the first (default) sheet of an Excel file
df = pd.read_excel("data/sales_report.xlsx")

print(df.shape)
print(df.dtypes)
print(df.head())

pd.read_excel() infers data types automatically — numbers become int64 or float64, text becomes object, and dates become datetime64[ns]. Always inspect df.dtypes after reading to verify types were inferred correctly.

Specifying the Sheet

Python
# By sheet name (string)
df = pd.read_excel("data/workbook.xlsx", sheet_name="Q3 Sales")

# By index (0-based integer — 0 is the first sheet)
df = pd.read_excel("data/workbook.xlsx", sheet_name=0)   # First sheet
df = pd.read_excel("data/workbook.xlsx", sheet_name=2)   # Third sheet

# Read ALL sheets — returns dict of {sheet_name: DataFrame}
all_sheets = pd.read_excel("data/workbook.xlsx", sheet_name=None)
print(list(all_sheets.keys()))  # ['Q1 Sales', 'Q2 Sales', 'Q3 Sales', 'Summary']

# Access individual sheets
q3_df = all_sheets["Q3 Sales"]

# Read specific named sheets as a dict
selected = pd.read_excel(
    "data/workbook.xlsx",
    sheet_name=["Q1 Sales", "Q3 Sales"]
)

Controlling Which Rows and Columns to Read

Real Excel files are rarely perfectly structured. Headers might start on row 3, footnotes at the bottom need skipping, and only a subset of columns are relevant:

Python
# Skip the first 2 rows (e.g., a title row and blank row before the header)
df = pd.read_excel("data/report.xlsx", skiprows=2)

# Skip specific rows by index (0-based)
df = pd.read_excel("data/report.xlsx", skiprows=[0, 1, 3])  # Skip rows 0, 1, 3

# Header is on row 3 (0-indexed row 2)
df = pd.read_excel("data/report.xlsx", header=2)

# Skip N rows from the bottom (footnotes, totals, etc.)
df = pd.read_excel("data/report.xlsx", skipfooter=3)

# Read only the first N rows of data (after header)
df = pd.read_excel("data/report.xlsx", nrows=100)

# Read specific columns by letter range or column index
df = pd.read_excel("data/report.xlsx", usecols="A:E")       # Columns A through E
df = pd.read_excel("data/report.xlsx", usecols="A,C,E:G")   # A, C, and E through G
df = pd.read_excel("data/report.xlsx", usecols=[0, 2, 4])   # Columns by index
df = pd.read_excel("data/report.xlsx", usecols=["Name", "Revenue", "Region"])  # By name

Controlling Data Types and Parsing

Python
# Explicitly specify column data types to avoid misinterpretation
df = pd.read_excel(
    "data/customers.xlsx",
    dtype={
        "customer_id":  str,     # Keep as string, not integer (avoids: '00123' → 123)
        "phone":        str,     # Phone numbers as strings
        "zip_code":     str,     # ZIP codes as strings
        "amount":       float,
        "is_active":    bool
    }
)

# Parse date columns automatically
df = pd.read_excel(
    "data/orders.xlsx",
    parse_dates=["order_date", "delivery_date"]
)

# Handle Excel's numeric date storage
# Excel stores dates as integers (days since Jan 1, 1900)
# pandas usually handles this automatically, but sometimes needs help:
df = pd.read_excel(
    "data/legacy_export.xlsx",
    parse_dates=["date_column"],
    date_format="%Y-%m-%d"     # Specify format if needed
)

Handling Multiple Header Rows (MultiIndex)

Some Excel files use two or more header rows to create hierarchical column labels:

Plaintext
              Jan 2024          Feb 2024
Region   Revenue   Units    Revenue   Units
East      45000    142       48000    158
West      38000    119       41000    126
Python
# header=[0, 1] tells pandas to use the first two rows as a multi-level header
df = pd.read_excel("data/multi_header_report.xlsx", header=[0, 1])

print(df.columns)
# MultiIndex([('Jan 2024', 'Revenue'), ('Jan 2024', 'Units'),
#             ('Feb 2024', 'Revenue'), ('Feb 2024', 'Units')],)

# Access columns with tuple notation
print(df[("Jan 2024", "Revenue")])

# Flatten multi-level headers to single-level
df.columns = ["_".join(col).strip() for col in df.columns]
print(df.columns.tolist())
# ['Jan 2024_Revenue', 'Jan 2024_Units', 'Feb 2024_Revenue', 'Feb 2024_Units']

Handling Missing Values

Excel files use various representations for missing data — blank cells, the string “N/A”, “n/a”, “NULL”, “-“, etc. Tell pandas to treat them all as NaN:

Python
df = pd.read_excel(
    "data/survey_results.xlsx",
    na_values=["N/A", "n/a", "NULL", "null", "-", "--", "none", "None", "NA", ""]
)

# Check result
print(df.isnull().sum())

A Complete read_excel() Parameter Reference

The most useful pd.read_excel() parameters in a single view:

ParameterTypeDescriptionExample
iostr/PathFile path or URL"data/sales.xlsx"
sheet_namestr/int/list/NoneSheet to read"Sales", 0, None (all)
headerint/listRow(s) for column names0 (default), 2, [0,1]
skiprowsint/listRows to skip before header2, [0, 1, 3]
skipfooterintRows to skip at bottom3
nrowsintMax rows to read1000
usecolsstr/listColumns to read"A:E", ["Name","Rev"]
dtypedictColumn data types{"id": str, "amt": float}
parse_dateslistColumns to parse as dates["order_date"]
na_valueslist/strAdditional NA strings["N/A", "-", "NULL"]
index_colint/strColumn to use as row index0, "customer_id"
enginestrParsing engine to use"openpyxl" (xlsx), "xlrd" (xls)

Reading Real-World Messy Excel Files

Clean Excel files with one table starting at row 1 are the exception. Real files are messy. Here are patterns for common situations.

Pattern 1: Title and Metadata Before the Data

Many reports have a company logo, report title, generation date, and other metadata in the first several rows:

Plaintext
Row 1: Acme Corp — Sales Report
Row 2: Generated: 2024-09-15
Row 3: (blank)
Row 4: Region | Product | Q1 Revenue | Q2 Revenue | ...
Row 5: East   | Widget A | 45000      | 48000      | ...
Python
# Preview to find the header row
df_raw = pd.read_excel("data/sales_report.xlsx", header=None, nrows=10)
print(df_raw)

# Skip the 3 title/metadata rows, use row 4 (index 3) as header
df = pd.read_excel("data/sales_report.xlsx", skiprows=3)
# pandas uses row 4 as the header automatically after skipping 3 rows

Pattern 2: Multiple Tables on One Sheet

Sometimes analysts put multiple separate tables on a single sheet:

Plaintext
Row 1-6:   Summary table
Row 7:     (blank)
Row 8-20:  Detail table
Python
# Read each table separately by specifying row ranges
summary_df = pd.read_excel("data/report.xlsx", nrows=5)           # Rows 2-6 (5 rows)
detail_df   = pd.read_excel("data/report.xlsx", skiprows=7)       # Rows 8 onward

# Or read the full sheet and split by blank rows
df_full = pd.read_excel("data/report.xlsx", header=None)
blank_rows = df_full[df_full.isnull().all(axis=1)].index.tolist()
print(f"Blank rows (separators): {blank_rows}")  # [6]

table1 = df_full.iloc[:blank_rows[0]].dropna(how="all")
table2 = df_full.iloc[blank_rows[0]+1:].dropna(how="all")
# Set first row as header for each table
table1.columns = table1.iloc[0]
table1 = table1.iloc[1:].reset_index(drop=True)

Pattern 3: Merged Cells

Merged cells are a common pain point — pandas unmerges them, filling continuation rows with NaN:

Plaintext
Excel display:          What pandas reads:
Region  | Sales         Region  | Sales
East    | 100           East    | 100
        | 200           NaN     | 200   ← NaN where merged cell continues
        | 150           NaN     | 150
West    | 80            West    | 80
        | 90            NaN     | 90
Python
df = pd.read_excel("data/merged_regions.xlsx")
print(df)
# Region  Sales
# East    100
# NaN     200
# NaN     150
# West    80
# NaN     90

# Fix: forward-fill the NaN values created by merged cells
df["Region"] = df["Region"].fillna(method="ffill")
# Or equivalently:
df["Region"] = df["Region"].ffill()

print(df)
# Region  Sales
# East    100
# East    200
# East    150
# West    80
# West    90

Pattern 4: Reading All Sheets and Combining

When a workbook has one sheet per time period (monthly, quarterly, yearly):

Python
import pandas as pd

def read_and_combine_sheets(filepath: str,
                            add_sheet_column: bool = True,
                            sheet_column_name: str = "sheet") -> pd.DataFrame:
    """
    Read all sheets from an Excel file and combine into one DataFrame.

    Parameters
    ----------
    filepath : str
        Path to the Excel workbook.
    add_sheet_column : bool
        If True, add a column with the originating sheet name.
    sheet_column_name : str
        Name for the sheet name column.

    Returns
    -------
    pd.DataFrame
        All sheets combined vertically.
    """
    all_sheets = pd.read_excel(filepath, sheet_name=None)
    dfs = []

    for sheet_name, df in all_sheets.items():
        if df.empty:
            continue
        if add_sheet_column:
            df[sheet_column_name] = sheet_name
        dfs.append(df)

    if not dfs:
        raise ValueError("No non-empty sheets found")

    combined = pd.concat(dfs, ignore_index=True)
    print(f"Combined {len(dfs)} sheets: {combined.shape}")
    return combined

# Combine all monthly sales sheets into one DataFrame
df_all = read_and_combine_sheets("data/monthly_sales_2024.xlsx",
                                  sheet_column_name="month")
print(df_all["month"].value_counts())

Writing Excel Files: to_excel() and ExcelWriter

Basic Writing

Python
import pandas as pd

df = pd.DataFrame({
    "customer_id":  ["CUST_001", "CUST_002", "CUST_003"],
    "name":         ["Jane Smith", "Bob Johnson", "Alice Williams"],
    "total_spend":  [384.96, 139.98, 269.97],
    "signup_date":  pd.to_datetime(["2022-03-14", "2021-11-28", "2023-01-05"]),
    "is_premium":   [True, False, True]
})

# Simple write — always set index=False to avoid writing the row numbers as a column
df.to_excel("output/customers.xlsx", index=False)

# Write to a specific sheet with a custom name
df.to_excel(
    "output/customers.xlsx",
    sheet_name="Customer Data",
    index=False
)

# Freeze the header row (stays visible when scrolling)
# This requires ExcelWriter — see next section

Writing Multiple Sheets with ExcelWriter

pd.ExcelWriter is the tool for writing multiple DataFrames to separate sheets in a single Excel file:

Python
import pandas as pd

# Sample DataFrames
summary_df = pd.DataFrame({
    "metric": ["Total Customers", "Total Revenue", "Avg Order Value", "Premium %"],
    "value":  [8, 1029.84, 102.98, 50.0]
})

customers_df = pd.DataFrame({
    "customer_id": ["CUST_001", "CUST_002", "CUST_003"],
    "name":        ["Jane Smith", "Bob Johnson", "Alice Williams"],
    "city":        ["Austin", "Seattle", "Chicago"],
    "total_spend": [384.96, 139.98, 269.97]
})

orders_df = pd.DataFrame({
    "order_id":   ["ORD_1001", "ORD_1002", "ORD_1003"],
    "customer_id":["CUST_001", "CUST_002", "CUST_001"],
    "amount":     [199.98, 49.99, 149.99],
    "status":     ["delivered", "delivered", "delivered"]
})

# Write all three DataFrames to separate sheets
with pd.ExcelWriter("output/full_report.xlsx", engine="openpyxl") as writer:
    summary_df.to_excel(writer, sheet_name="Summary",   index=False)
    customers_df.to_excel(writer, sheet_name="Customers", index=False)
    orders_df.to_excel(writer, sheet_name="Orders",     index=False)

print("Multi-sheet workbook written successfully.")

The with block ensures the file is properly saved and closed even if an error occurs during writing.

Appending to an Existing Excel File

Python
# Open an existing file and add a new sheet without touching existing sheets
with pd.ExcelWriter(
    "output/full_report.xlsx",
    engine="openpyxl",
    mode="a"                    # "a" = append mode (don't overwrite)
) as writer:
    new_predictions_df.to_excel(writer, sheet_name="Churn Predictions", index=False)

print("New sheet added to existing workbook.")

Important: If you open an existing Excel file with mode="a" and write to a sheet name that already exists, it raises an error. Use if_sheet_exists="replace" to overwrite:

Python
with pd.ExcelWriter(
    "output/report.xlsx",
    engine="openpyxl",
    mode="a",
    if_sheet_exists="replace"  # Overwrite if sheet already exists
) as writer:
    updated_summary.to_excel(writer, sheet_name="Summary", index=False)

Writing to a Specific Position (startrow, startcol)

Write a DataFrame at a specific row and column offset — useful when adding a title row or multiple tables on one sheet:

Python
with pd.ExcelWriter("output/positioned_report.xlsx", engine="openpyxl") as writer:
    # Write a title manually (we'll add it via openpyxl formatting)
    # Leave row 0 for the title, start data at row 1 (Excel row 2)
    df.to_excel(
        writer,
        sheet_name="Report",
        startrow=1,      # Start data at row index 1 (Excel row 2)
        startcol=0,
        index=False
    )

    # Write a second table to the right of the first
    summary_df.to_excel(
        writer,
        sheet_name="Report",
        startrow=1,
        startcol=df.shape[1] + 2,   # Two columns gap to the right
        index=False
    )

Formatting Excel Output with openpyxl

pd.to_excel() writes clean data but produces plain, unformatted spreadsheets. For professional-looking output with headers styled, columns auto-sized, and number formats applied, use openpyxl directly after writing with ExcelWriter.

Accessing the Workbook After Writing

Python
import pandas as pd
from openpyxl import load_workbook
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
from openpyxl.utils import get_column_letter

# Step 1: Write data with pandas
df.to_excel("output/formatted_report.xlsx", sheet_name="Data", index=False)

# Step 2: Open with openpyxl for formatting
wb = load_workbook("output/formatted_report.xlsx")
ws = wb["Data"]

Styling Headers

Python
# Style the header row
header_font = Font(bold=True, color="FFFFFF", size=11)
header_fill = PatternFill(fill_type="solid", start_color="2F75B6")  # Blue background
header_alignment = Alignment(horizontal="center", vertical="center")

for cell in ws[1]:   # Row 1 is the header row (1-indexed in openpyxl)
    cell.font = header_font
    cell.fill = header_fill
    cell.alignment = header_alignment

# Set header row height
ws.row_dimensions[1].height = 20

Auto-Sizing Column Widths

openpyxl doesn’t auto-fit columns natively, but this function calculates appropriate widths:

Python
def auto_size_columns(worksheet, min_width: int = 10, max_width: int = 40):
    """
    Auto-size all columns in a worksheet based on cell content length.
    """
    for column_cells in worksheet.columns:
        max_length = 0
        col_letter = get_column_letter(column_cells[0].column)

        for cell in column_cells:
            if cell.value is not None:
                try:
                    cell_length = len(str(cell.value))
                    max_length = max(max_length, cell_length)
                except Exception:
                    pass

        adjusted_width = max(min_width, min(max_length + 2, max_width))
        worksheet.column_dimensions[col_letter].width = adjusted_width

auto_size_columns(ws)

Number Formatting

Python
# Apply number formats to specific columns
# openpyxl uses Excel's built-in number format codes

for row in ws.iter_rows(min_row=2, min_col=3, max_col=3):  # Column C, all data rows
    for cell in row:
        cell.number_format = "$#,##0.00"    # Currency with 2 decimal places

for row in ws.iter_rows(min_row=2, min_col=4, max_col=4):  # Column D — percentages
    for cell in row:
        cell.number_format = "0.0%"

for row in ws.iter_rows(min_row=2, min_col=5, max_col=5):  # Column E — dates
    for cell in row:
        cell.number_format = "YYYY-MM-DD"

Adding Freeze Panes

Freeze the header row (and optionally first column) so they stay visible when scrolling:

Python
# Freeze row 1 (header stays visible when scrolling down)
ws.freeze_panes = "A2"

# Freeze both header row AND first column
ws.freeze_panes = "B2"

# Freeze only the first column
ws.freeze_panes = "B1"

Adding Alternating Row Colors (Zebra Stripes)

Python
from openpyxl.styles import PatternFill

light_fill = PatternFill(fill_type="solid", start_color="EBF5FB")  # Light blue
white_fill = PatternFill(fill_type="solid", start_color="FFFFFF")  # White

for row_idx in range(2, ws.max_row + 1):   # Start from row 2 (skip header)
    fill = light_fill if row_idx % 2 == 0 else white_fill
    for cell in ws[row_idx]:
        cell.fill = fill

Complete Formatting Example

Python
import pandas as pd
from openpyxl import load_workbook
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
from openpyxl.utils import get_column_letter

def write_formatted_excel(df: pd.DataFrame, filepath: str,
                           sheet_name: str = "Data",
                           title: str = None) -> None:
    """
    Write a DataFrame to a professionally formatted Excel file.

    Parameters
    ----------
    df : pd.DataFrame
        Data to write.
    filepath : str
        Output file path.
    sheet_name : str
        Name for the sheet.
    title : str, optional
        Title to display above the table.
    """
    # Write data with pandas
    start_row = 2 if title else 0   # Leave row 1 for title if provided
    df.to_excel(filepath, sheet_name=sheet_name,
                startrow=start_row, index=False)

    # Open for formatting
    wb = load_workbook(filepath)
    ws = wb[sheet_name]
    header_row = start_row + 1   # Excel rows are 1-indexed

    # Add title if provided
    if title:
        ws["A1"] = title
        ws["A1"].font = Font(bold=True, size=14)
        ws["A1"].alignment = Alignment(horizontal="left")
        ws.row_dimensions[1].height = 22

    # Style header row
    header_font  = Font(bold=True, color="FFFFFF", size=11)
    header_fill  = PatternFill(fill_type="solid", start_color="2F75B6")
    header_align = Alignment(horizontal="center", vertical="center")

    for cell in ws[header_row]:
        cell.font      = header_font
        cell.fill      = header_fill
        cell.alignment = header_align
    ws.row_dimensions[header_row].height = 18

    # Alternating row colors for data rows
    light_fill = PatternFill(fill_type="solid", start_color="DEEAF1")
    for row_idx in range(header_row + 1, ws.max_row + 1):
        fill = light_fill if (row_idx - header_row) % 2 == 0 else None
        if fill:
            for cell in ws[row_idx]:
                cell.fill = fill

    # Freeze the header row
    ws.freeze_panes = f"A{header_row + 1}"

    # Auto-size columns
    for column_cells in ws.columns:
        max_len = 0
        col_letter = get_column_letter(column_cells[0].column)
        for cell in column_cells:
            if cell.value:
                max_len = max(max_len, len(str(cell.value)))
        ws.column_dimensions[col_letter].width = min(max(max_len + 2, 10), 40)

    wb.save(filepath)
    print(f"Formatted Excel written: {filepath}")


# Usage
df_customers = pd.DataFrame({
    "Customer ID":  ["CUST_001", "CUST_002", "CUST_003"],
    "Name":         ["Jane Smith", "Bob Johnson", "Alice Williams"],
    "City":         ["Austin", "Seattle", "Chicago"],
    "Total Spend":  [384.96, 139.98, 269.97],
    "Premium":      [True, False, True],
    "Last Order":   pd.to_datetime(["2024-09-10", "2024-09-01", "2024-09-15"])
})

write_formatted_excel(
    df_customers,
    "output/customer_report_formatted.xlsx",
    title="Customer Summary — September 2024"
)

Common Problems and Solutions

Problem 1: Mixed Data Types in a Column

Excel allows mixed types within a column (some cells text, others numbers). pandas infers a type per column, which can cause unexpected results:

Python
# Column "zip_code" inferred as int64 — drops leading zeros
# CUST_001: 02134 → 2134 (Boston ZIP incorrectly truncated)

# Fix: Force string type for columns that look numeric but aren't
df = pd.read_excel("data/customers.xlsx", dtype={"zip_code": str})

# Result: "02134" preserved as string

Problem 2: Dates Stored as Numbers

Excel sometimes exports dates as serial numbers (days since 1900-01-01) rather than date strings:

Python
# Column reads as 45183 instead of 2023-09-15

# Fix 1: Parse dates automatically
df = pd.read_excel("data/orders.xlsx", parse_dates=["order_date"])

# Fix 2: If auto-parsing fails, convert the Excel date serial manually
from datetime import datetime, timedelta

def excel_date_to_datetime(excel_serial: int) -> datetime:
    """Convert Excel date serial number to Python datetime."""
    # Excel serial 1 = Jan 1, 1900; but Excel incorrectly treats 1900 as a leap year
    return datetime(1899, 12, 30) + timedelta(days=int(excel_serial))

df["order_date"] = df["order_date"].apply(excel_date_to_datetime)

Problem 3: Formula Cells Return NaN

When an Excel file contains formulas, openpyxl reads the cached formula string, not the calculated value, when opening normally:

Python
from openpyxl import load_workbook

# Reading a file with formulas
wb = load_workbook("data/formula_report.xlsx")
ws = wb.active
print(ws["B5"].value)  # Returns "=SUM(B2:B4)" — the formula string!

# Fix: Use data_only=True to read the cached calculated values
wb = load_workbook("data/formula_report.xlsx", data_only=True)
ws = wb.active
print(ws["B5"].value)  # Returns 157.98 — the cached calculated value
# Note: cached values are from the last time the file was opened in Excel

For data science purposes, use data_only=True when you want the calculated values rather than the formulas.

Problem 4: File Already Open in Excel (Windows)

On Windows, if the Excel file is open in the Excel application, Python will fail to write to it:

Python
try:
    df.to_excel("output/report.xlsx", index=False)
except PermissionError:
    print("ERROR: The file is open in Excel. Close it and try again.")
    # Or write to a differently named file:
    df.to_excel("output/report_new.xlsx", index=False)

Problem 5: Unnamed Columns from Merged Headers

When a workbook has merged header cells, the merged region typically shows the value in the first cell only, leaving the subsequent cells empty — which pandas reads as Unnamed: 1, Unnamed: 2, etc.:

Python
# Read the file first to see the actual column names
df_check = pd.read_excel("data/merged_header_report.xlsx", nrows=2)
print(df_check.columns.tolist())
# ['Region', 'Unnamed: 1', 'Q1 Revenue', 'Q2 Revenue', ...]

# Rename columns manually after reading
df = pd.read_excel("data/merged_header_report.xlsx", skiprows=1)
df.columns = ["Region", "Product", "Q1_Revenue", "Q2_Revenue", "Q3_Revenue"]

# Or use a forward-fill approach on the header row
df_raw = pd.read_excel("data/merged_header_report.xlsx", header=None, nrows=2)
headers = df_raw.iloc[0].fillna(method="ffill")  # Forward-fill the merged cells
print(headers.tolist())

Problem 6: Large Excel Files Running Slowly

Excel files with hundreds of thousands of rows load slowly because Excel format is not designed for performance. Solutions:

Python
# Option 1: Read only the columns you need
df = pd.read_excel("data/large_report.xlsx", usecols=["date", "revenue", "region"])

# Option 2: Convert to CSV or Parquet first (much faster for large data)
# If you have to work with a huge Excel file repeatedly:
import pandas as pd

# Do this once
df = pd.read_excel("data/massive_file.xlsx")
df.to_parquet("data/massive_file.parquet")  # Save as Parquet

# All subsequent reads are 10-100x faster
df = pd.read_parquet("data/massive_file.parquet")

# Option 3: Use read_only mode in openpyxl for very large files
from openpyxl import load_workbook

wb = load_workbook("data/massive_file.xlsx", read_only=True)
ws = wb.active

# Iterate rows without loading whole file into memory
for row in ws.iter_rows(values_only=True):
    process_row(row)

wb.close()

Working with Excel Tables (ListObjects)

Excel has a native “Table” feature (Insert → Table) that creates structured table objects with named ranges and automatic filtering. openpyxl can work with these:

Python
from openpyxl import load_workbook
from openpyxl.worksheet.table import Table, TableStyleInfo

# Read data from a named Excel table
wb = load_workbook("data/workbook_with_tables.xlsx")
ws = wb.active

# List all tables in the sheet
for table_name, table in ws.tables.items():
    print(f"Table: {table_name}, Range: {table.ref}")

# Create an Excel Table when writing
with pd.ExcelWriter("output/table_report.xlsx", engine="openpyxl") as writer:
    df.to_excel(writer, sheet_name="Sales", index=False)

# Open the file to add Table formatting
wb = load_workbook("output/table_report.xlsx")
ws = wb["Sales"]

# Add an Excel Table over the data range
tab = Table(
    displayName="SalesTable",
    ref=f"A1:{get_column_letter(ws.max_column)}{ws.max_row}"
)

# Apply a built-in table style
style = TableStyleInfo(
    name="TableStyleMedium9",  # Blue medium style
    showFirstColumn=False,
    showLastColumn=False,
    showRowStripes=True,
    showColumnStripes=False
)
tab.tableStyleInfo = style
ws.add_table(tab)
ws.freeze_panes = "A2"

wb.save("output/table_report.xlsx")

A Complete Pipeline Example

A realistic end-to-end example: read messy Excel input, clean it, analyze it, and write formatted output:

Python
import pandas as pd
from openpyxl import load_workbook
from openpyxl.styles import Font, PatternFill, Alignment
from openpyxl.utils import get_column_letter
from pathlib import Path

def process_sales_excel(input_path: str, output_path: str) -> dict:
    """
    Complete pipeline: read → clean → analyze → write formatted output.

    Reads a messy monthly sales Excel file (title row, merged regions,
    multiple sheets), produces a clean summary workbook.
    """

    # ── Step 1: Read all sheets ────────────────────────────────────
    print(f"Reading {input_path}...")
    all_sheets = pd.read_excel(input_path, sheet_name=None)
    print(f"Sheets found: {list(all_sheets.keys())}")

    # ── Step 2: Combine monthly sheets ────────────────────────────
    monthly_dfs = []
    for sheet_name, df in all_sheets.items():
        if df.empty or sheet_name.lower() == "summary":
            continue
        # Each sheet has 2 title rows — skip them
        df = pd.read_excel(input_path, sheet_name=sheet_name,
                           skiprows=2, dtype={"product_id": str})
        df["month"] = sheet_name
        monthly_dfs.append(df)

    combined = pd.concat(monthly_dfs, ignore_index=True)
    print(f"Combined: {combined.shape}")

    # ── Step 3: Clean ──────────────────────────────────────────────
    combined.columns = combined.columns.str.strip().str.lower().str.replace(" ", "_")
    combined = combined.dropna(subset=["product_id", "revenue"])
    combined["revenue"] = pd.to_numeric(combined["revenue"], errors="coerce")
    combined["units"]   = pd.to_numeric(combined["units"],   errors="coerce")
    combined["date"]    = pd.to_datetime(combined["date"],   errors="coerce")

    # ── Step 4: Analyze ────────────────────────────────────────────
    by_region = (
        combined
        .groupby("region")
        .agg(
            total_revenue=("revenue", "sum"),
            total_units=("units", "sum"),
            avg_order=("revenue", "mean"),
            num_transactions=("product_id", "count")
        )
        .round(2)
        .sort_values("total_revenue", ascending=False)
        .reset_index()
    )

    by_product = (
        combined
        .groupby(["product_id", "product_name"])
        .agg(
            total_revenue=("revenue", "sum"),
            total_units=("units", "sum")
        )
        .round(2)
        .sort_values("total_revenue", ascending=False)
        .reset_index()
    )

    # ── Step 5: Write formatted output ────────────────────────────
    Path(output_path).parent.mkdir(parents=True, exist_ok=True)

    with pd.ExcelWriter(output_path, engine="openpyxl") as writer:
        combined.to_excel(writer,    sheet_name="Raw Data",        index=False)
        by_region.to_excel(writer,   sheet_name="By Region",       index=False)
        by_product.to_excel(writer,  sheet_name="By Product",      index=False)

    # Apply formatting
    wb = load_workbook(output_path)
    for sheet_name in wb.sheetnames:
        ws = wb[sheet_name]
        # Style header row
        for cell in ws[1]:
            cell.font  = Font(bold=True, color="FFFFFF")
            cell.fill  = PatternFill("solid", start_color="2F75B6")
            cell.alignment = Alignment(horizontal="center")
        # Freeze header
        ws.freeze_panes = "A2"
        # Auto-size columns
        for col in ws.columns:
            col_letter = get_column_letter(col[0].column)
            max_len = max((len(str(c.value)) for c in col if c.value), default=8)
            ws.column_dimensions[col_letter].width = min(max_len + 2, 40)

    wb.save(output_path)
    print(f"Output written: {output_path}")

    return {
        "total_revenue":    combined["revenue"].sum(),
        "total_units":      combined["units"].sum(),
        "num_transactions": len(combined),
        "num_regions":      combined["region"].nunique()
    }

# Run the pipeline
# metrics = process_sales_excel(
#     "data/raw/sales_2024.xlsx",
#     "output/sales_analysis_2024.xlsx"
# )
# print(f"Total revenue: ${metrics['total_revenue']:,.2f}")

Summary

Excel file handling in pandas combines the powerful data manipulation capabilities of pandas with openpyxl’s Excel-specific features. pd.read_excel() handles the full range of real-world Excel files — from clean single-table files to messy multi-header, merged-cell, multi-table workbooks — through its extensive parameter set. pd.ExcelWriter writes clean, multi-sheet workbooks, and openpyxl adds the professional formatting layer (styled headers, auto-sized columns, freeze panes, number formats) that makes output presentable to non-technical stakeholders.

The practical pattern for most data science Excel work is: read with pd.read_excel() using the appropriate parameters for the file’s structure, clean and transform in pandas, write with ExcelWriter, and optionally apply formatting with openpyxl. The common problems — mixed types, Excel serial dates, formula cells, merged headers — each have straightforward fixes that become automatic with experience.

Key Takeaways

  • pd.read_excel("file.xlsx") reads the first sheet; use sheet_name="Name" for a specific sheet, or sheet_name=None to read all sheets as a dict of DataFrames
  • Always specify dtype={"id": str} for columns like customer IDs, ZIP codes, and phone numbers — without it, pandas may infer them as integers, truncating leading zeros and breaking joins
  • skiprows, header, usecols, nrows, and skipfooter handle virtually any non-standard Excel layout — inspect the file with nrows=10, header=None first to understand the structure
  • Forward-fill (df["col"].ffill()) fixes NaN values created by merged cells that pandas unmerges on read
  • pd.ExcelWriter with engine="openpyxl" writes multiple DataFrames to separate sheets; use mode="a" to append new sheets to an existing file without disturbing existing ones
  • openpyxl post-processing adds professional formatting: Font, PatternFill, Alignment for header styling; freeze_panes to keep headers visible while scrolling; auto-sized column widths; and number formats for currency, percentages, and dates
  • load_workbook("file.xlsx", data_only=True) reads cached formula results rather than formula strings — use this when you need the calculated values from a formula-heavy workbook
  • For large Excel files, convert to Parquet once (df.to_parquet()) and read Parquet for all subsequent operations — the speed difference is 10–100×
Share:
Subscribe
Notify of
0 Comments

Discover More

What Can You Actually Build as a Beginner Roboticist?

Discover realistic robot projects for beginners. From obstacle-avoiding rovers to line followers, learn what you…

SFINAE: Substitution Failure Is Not An Error

SFINAE: Substitution Failure Is Not An Error

Master SFINAE in C++ — learn what substitution failure means, how enable_if works, how to…

PyCharm for Data Science: Configuration and Best Practices

PyCharm for Data Science: Configuration and Best Practices

Learn how to configure PyCharm for data science. Explore Professional vs Community editions, Jupyter support,…

Move Semantics in C++11: Optimizing Performance

Move Semantics in C++11: Optimizing Performance

Master C++11 move semantics — learn move constructors, move assignment operators, std::move, and how to…

Connecting to Databases from Python

Connecting to Databases from Python

Learn how to connect to databases from Python. Master SQLite, PostgreSQL, SQLAlchemy, pandas read_sql, connection…

The Data Science Workflow: From Problem to Solution

Master the data science workflow with this comprehensive guide covering problem definition, data collection, cleaning,…

Click For More
0
Would love your thoughts, please comment.x
()
x