Working with Excel Files in Pandas

Learn to read and write Excel files in pandas. Master read_excel, multiple sheets, ExcelWriter, formatting with openpyxl, common problems, and best practices for data science.

By Techietory on May 24, 2026

pandas reads Excel files with pd.read_excel("file.xlsx") and writes them with df.to_excel("output.xlsx", index=False). For multiple sheets, sheet_name=None returns all sheets as a dictionary of DataFrames, and pd.ExcelWriter writes multiple DataFrames to separate sheets in a single file. The openpyxl engine (for .xlsx) and xlrd engine (for older .xls files) power Excel I/O in pandas, with openpyxl also enabling cell-level formatting, formulas, and styling beyond what pandas alone provides.

Introduction

Excel is the world’s most widely used data tool. Virtually every organization stores and shares data in Excel files — financial reports, survey results, sales records, experiment logs, client lists, configuration tables. As a data scientist, you’ll encounter Excel files constantly: receiving them from colleagues, clients, and data providers; needing to clean and analyze their contents; and often expected to deliver your results back in Excel format.

pandas has robust Excel support built in. Whether you’re reading a single clean table, extracting data from one of a dozen sheets, dealing with headers buried three rows down, or writing a polished multi-sheet report with formatted summaries, pandas — and its underlying library openpyxl — handles it all.

This article covers the complete spectrum of Excel work in pandas: reading single and multiple sheets with all the important options, handling messy real-world Excel files with non-standard layouts, writing DataFrames back to Excel with precise control, creating multi-sheet workbooks, adding basic formatting, and navigating the common problems that arise when working with Excel files programmatically.

Setting Up: Required Libraries

Python

pip install pandas openpyxl xlrd

pip install pandas openpyxl xlrd

pandas: The primary interface for reading and writing Excel files
openpyxl: The engine pandas uses for .xlsx files (Excel 2007 and newer) — also used directly for advanced formatting
xlrd: Required for reading older .xls format files (Excel 97-2003)

Check your versions:

Python

import pandas as pd
import openpyxl

print(f"pandas: {pd.__version__}")
print(f"openpyxl: {openpyxl.__version__}")

import pandas as pd
import openpyxl

print(f"pandas: {pd.__version__}")
print(f"openpyxl: {openpyxl.__version__}")

Reading Excel Files: pd.read_excel()

The Basic Pattern

Python

import pandas as pd

# Read the first (default) sheet of an Excel file
df = pd.read_excel("data/sales_report.xlsx")

print(df.shape)
print(df.dtypes)
print(df.head())

import pandas as pd

# Read the first (default) sheet of an Excel file
df = pd.read_excel("data/sales_report.xlsx")

print(df.shape)
print(df.dtypes)
print(df.head())

pd.read_excel() infers data types automatically — numbers become int64 or float64, text becomes object, and dates become datetime64[ns]. Always inspect df.dtypes after reading to verify types were inferred correctly.

Specifying the Sheet

Python

# By sheet name (string)
df = pd.read_excel("data/workbook.xlsx", sheet_name="Q3 Sales")

# By index (0-based integer — 0 is the first sheet)
df = pd.read_excel("data/workbook.xlsx", sheet_name=0)   # First sheet
df = pd.read_excel("data/workbook.xlsx", sheet_name=2)   # Third sheet

# Read ALL sheets — returns dict of {sheet_name: <a href="https://techietory.com/data-science/what-is-a-dataframe-the-foundation-of-data-science/">DataFrame</a>}
all_sheets = pd.read_excel("data/workbook.xlsx", sheet_name=None)
print(list(all_sheets.keys()))  # ['Q1 Sales', 'Q2 Sales', 'Q3 Sales', 'Summary']

# Access individual sheets
q3_df = all_sheets["Q3 Sales"]

# Read specific named sheets as a dict
selected = pd.read_excel(
    "data/workbook.xlsx",
    sheet_name=["Q1 Sales", "Q3 Sales"]
)

# By sheet name (string)
df = pd.read_excel("data/workbook.xlsx", sheet_name="Q3 Sales")

# By index (0-based integer — 0 is the first sheet)
df = pd.read_excel("data/workbook.xlsx", sheet_name=0)   # First sheet
df = pd.read_excel("data/workbook.xlsx", sheet_name=2)   # Third sheet

# Read ALL sheets — returns dict of {sheet_name: DataFrame}
all_sheets = pd.read_excel("data/workbook.xlsx", sheet_name=None)
print(list(all_sheets.keys()))  # ['Q1 Sales', 'Q2 Sales', 'Q3 Sales', 'Summary']

# Access individual sheets
q3_df = all_sheets["Q3 Sales"]

# Read specific named sheets as a dict
selected = pd.read_excel(
    "data/workbook.xlsx",
    sheet_name=["Q1 Sales", "Q3 Sales"]
)

Controlling Which Rows and Columns to Read

Real Excel files are rarely perfectly structured. Headers might start on row 3, footnotes at the bottom need skipping, and only a subset of columns are relevant:

Python

# Skip the first 2 rows (e.g., a title row and blank row before the header)
df = pd.read_excel("data/report.xlsx", skiprows=2)

# Skip specific rows by index (0-based)
df = pd.read_excel("data/report.xlsx", skiprows=[0, 1, 3])  # Skip rows 0, 1, 3

# Header is on row 3 (0-indexed row 2)
df = pd.read_excel("data/report.xlsx", header=2)

# Skip N rows from the bottom (footnotes, totals, etc.)
df = pd.read_excel("data/report.xlsx", skipfooter=3)

# Read only the first N rows of data (after header)
df = pd.read_excel("data/report.xlsx", nrows=100)

# Read specific columns by letter range or column index
df = pd.read_excel("data/report.xlsx", usecols="A:E")       # Columns A through E
df = pd.read_excel("data/report.xlsx", usecols="A,C,E:G")   # A, C, and E through G
df = pd.read_excel("data/report.xlsx", usecols=[0, 2, 4])   # Columns by index
df = pd.read_excel("data/report.xlsx", usecols=["Name", "Revenue", "Region"])  # By name

# Skip the first 2 rows (e.g., a title row and blank row before the header)
df = pd.read_excel("data/report.xlsx", skiprows=2)

# Skip specific rows by index (0-based)
df = pd.read_excel("data/report.xlsx", skiprows=[0, 1, 3])  # Skip rows 0, 1, 3

# Header is on row 3 (0-indexed row 2)
df = pd.read_excel("data/report.xlsx", header=2)

# Skip N rows from the bottom (footnotes, totals, etc.)
df = pd.read_excel("data/report.xlsx", skipfooter=3)

# Read only the first N rows of data (after header)
df = pd.read_excel("data/report.xlsx", nrows=100)

# Read specific columns by letter range or column index
df = pd.read_excel("data/report.xlsx", usecols="A:E")       # Columns A through E
df = pd.read_excel("data/report.xlsx", usecols="A,C,E:G")   # A, C, and E through G
df = pd.read_excel("data/report.xlsx", usecols=[0, 2, 4])   # Columns by index
df = pd.read_excel("data/report.xlsx", usecols=["Name", "Revenue", "Region"])  # By name

Controlling Data Types and Parsing

Python

# Explicitly specify column data types to avoid misinterpretation
df = pd.read_excel(
    "data/customers.xlsx",
    dtype={
        "customer_id":  str,     # Keep as string, not integer (avoids: '00123' → 123)
        "phone":        str,     # Phone numbers as strings
        "zip_code":     str,     # ZIP codes as strings
        "amount":       float,
        "is_active":    bool
    }
)

# Parse date columns automatically
df = pd.read_excel(
    "data/orders.xlsx",
    parse_dates=["order_date", "delivery_date"]
)

# Handle Excel's numeric date storage
# Excel stores dates as integers (days since Jan 1, 1900)
# pandas usually handles this automatically, but sometimes needs help:
df = pd.read_excel(
    "data/legacy_export.xlsx",
    parse_dates=["date_column"],
    date_format="%Y-%m-%d"     # Specify format if needed
)

# Explicitly specify column data types to avoid misinterpretation
df = pd.read_excel(
    "data/customers.xlsx",
    dtype={
        "customer_id":  str,     # Keep as string, not integer (avoids: '00123' → 123)
        "phone":        str,     # Phone numbers as strings
        "zip_code":     str,     # ZIP codes as strings
        "amount":       float,
        "is_active":    bool
    }
)

# Parse date columns automatically
df = pd.read_excel(
    "data/orders.xlsx",
    parse_dates=["order_date", "delivery_date"]
)

# Handle Excel's numeric date storage
# Excel stores dates as integers (days since Jan 1, 1900)
# pandas usually handles this automatically, but sometimes needs help:
df = pd.read_excel(
    "data/legacy_export.xlsx",
    parse_dates=["date_column"],
    date_format="%Y-%m-%d"     # Specify format if needed
)

Handling Multiple Header Rows (MultiIndex)

Some Excel files use two or more header rows to create hierarchical column labels:

Plaintext

              Jan 2024          Feb 2024
Region   Revenue   Units    Revenue   Units
East      45000    142       48000    158
West      38000    119       41000    126

              Jan 2024          Feb 2024
Region   Revenue   Units    Revenue   Units
East      45000    142       48000    158
West      38000    119       41000    126

Python

# header=[0, 1] tells pandas to use the first two rows as a multi-level header
df = pd.read_excel("data/multi_header_report.xlsx", header=[0, 1])

print(df.columns)
# MultiIndex([('Jan 2024', 'Revenue'), ('Jan 2024', 'Units'),
#             ('Feb 2024', 'Revenue'), ('Feb 2024', 'Units')],)

# Access columns with tuple notation
print(df[("Jan 2024", "Revenue")])

# Flatten multi-level headers to single-level
df.columns = ["_".join(col).strip() for col in df.columns]
print(df.columns.tolist())
# ['Jan 2024_Revenue', 'Jan 2024_Units', 'Feb 2024_Revenue', 'Feb 2024_Units']

# header=[0, 1] tells pandas to use the first two rows as a multi-level header
df = pd.read_excel("data/multi_header_report.xlsx", header=[0, 1])

print(df.columns)
# MultiIndex([('Jan 2024', 'Revenue'), ('Jan 2024', 'Units'),
#             ('Feb 2024', 'Revenue'), ('Feb 2024', 'Units')],)

# Access columns with tuple notation
print(df[("Jan 2024", "Revenue")])

# Flatten multi-level headers to single-level
df.columns = ["_".join(col).strip() for col in df.columns]
print(df.columns.tolist())
# ['Jan 2024_Revenue', 'Jan 2024_Units', 'Feb 2024_Revenue', 'Feb 2024_Units']

Handling Missing Values

Excel files use various representations for missing data — blank cells, the string “N/A”, “n/a”, “NULL”, “-“, etc. Tell pandas to treat them all as NaN:

Python

df = pd.read_excel(
    "data/survey_results.xlsx",
    na_values=["N/A", "n/a", "NULL", "null", "-", "--", "none", "None", "NA", ""]
)

# Check result
print(df.isnull().sum())

df = pd.read_excel(
    "data/survey_results.xlsx",
    na_values=["N/A", "n/a", "NULL", "null", "-", "--", "none", "None", "NA", ""]
)

# Check result
print(df.isnull().sum())

A Complete read_excel() Parameter Reference

The most useful pd.read_excel() parameters in a single view:

Parameter	Type	Description	Example
`io`	str/Path	File path or URL	`"data/sales.xlsx"`
`sheet_name`	str/int/list/None	Sheet to read	`"Sales"`, `0`, `None` (all)
`header`	int/list	Row(s) for column names	`0` (default), `2`, `[0,1]`
`skiprows`	int/list	Rows to skip before header	`2`, `[0, 1, 3]`
`skipfooter`	int	Rows to skip at bottom	`3`
`nrows`	int	Max rows to read	`1000`
`usecols`	str/list	Columns to read	`"A:E"`, `["Name","Rev"]`
`dtype`	dict	Column data types	`{"id": str, "amt": float}`
`parse_dates`	list	Columns to parse as dates	`["order_date"]`
`na_values`	list/str	Additional NA strings	`["N/A", "-", "NULL"]`
`index_col`	int/str	Column to use as row index	`0`, `"customer_id"`
`engine`	str	Parsing engine to use	`"openpyxl"` (xlsx), `"xlrd"` (xls)

Reading Real-World Messy Excel Files

Clean Excel files with one table starting at row 1 are the exception. Real files are messy. Here are patterns for common situations.

Pattern 1: Title and Metadata Before the Data

Many reports have a company logo, report title, generation date, and other metadata in the first several rows:

Plaintext

Row 1: Acme Corp — Sales Report
Row 2: Generated: 2024-09-15
Row 3: (blank)
Row 4: Region | Product | Q1 Revenue | Q2 Revenue | ...
Row 5: East   | Widget A | 45000      | 48000      | ...

Row 1: Acme Corp — Sales Report
Row 2: Generated: 2024-09-15
Row 3: (blank)
Row 4: Region | Product | Q1 Revenue | Q2 Revenue | ...
Row 5: East   | Widget A | 45000      | 48000      | ...

Python

# Preview to find the header row
df_raw = pd.read_excel("data/sales_report.xlsx", header=None, nrows=10)
print(df_raw)

# Skip the 3 title/metadata rows, use row 4 (index 3) as header
df = pd.read_excel("data/sales_report.xlsx", skiprows=3)
# pandas uses row 4 as the header automatically after skipping 3 rows

# Preview to find the header row
df_raw = pd.read_excel("data/sales_report.xlsx", header=None, nrows=10)
print(df_raw)

# Skip the 3 title/metadata rows, use row 4 (index 3) as header
df = pd.read_excel("data/sales_report.xlsx", skiprows=3)
# pandas uses row 4 as the header automatically after skipping 3 rows

Pattern 2: Multiple Tables on One Sheet

Sometimes analysts put multiple separate tables on a single sheet:

Plaintext

Row 1-6:   Summary table
Row 7:     (blank)
Row 8-20:  Detail table

Row 1-6:   Summary table
Row 7:     (blank)
Row 8-20:  Detail table

Python

# Read each table separately by specifying row ranges
summary_df = pd.read_excel("data/report.xlsx", nrows=5)           # Rows 2-6 (5 rows)
detail_df   = pd.read_excel("data/report.xlsx", skiprows=7)       # Rows 8 onward

# Or read the full sheet and split by blank rows
df_full = pd.read_excel("data/report.xlsx", header=None)
blank_rows = df_full[df_full.isnull().all(axis=1)].index.tolist()
print(f"Blank rows (separators): {blank_rows}")  # [6]

table1 = df_full.iloc[:blank_rows[0]].dropna(how="all")
table2 = df_full.iloc[blank_rows[0]+1:].dropna(how="all")
# Set first row as header for each table
table1.columns = table1.iloc[0]
table1 = table1.iloc[1:].reset_index(drop=True)

# Read each table separately by specifying row ranges
summary_df = pd.read_excel("data/report.xlsx", nrows=5)           # Rows 2-6 (5 rows)
detail_df   = pd.read_excel("data/report.xlsx", skiprows=7)       # Rows 8 onward

# Or read the full sheet and split by blank rows
df_full = pd.read_excel("data/report.xlsx", header=None)
blank_rows = df_full[df_full.isnull().all(axis=1)].index.tolist()
print(f"Blank rows (separators): {blank_rows}")  # [6]

table1 = df_full.iloc[:blank_rows[0]].dropna(how="all")
table2 = df_full.iloc[blank_rows[0]+1:].dropna(how="all")
# Set first row as header for each table
table1.columns = table1.iloc[0]
table1 = table1.iloc[1:].reset_index(drop=True)

Pattern 3: Merged Cells

Merged cells are a common pain point — pandas unmerges them, filling continuation rows with NaN:

Plaintext

Excel display:          What pandas reads:
Region  | Sales         Region  | Sales
East    | 100           East    | 100
        | 200           NaN     | 200   ← NaN where merged cell continues
        | 150           NaN     | 150
West    | 80            West    | 80
        | 90            NaN     | 90

Excel display:          What pandas reads:
Region  | Sales         Region  | Sales
East    | 100           East    | 100
        | 200           NaN     | 200   ← NaN where merged cell continues
        | 150           NaN     | 150
West    | 80            West    | 80
        | 90            NaN     | 90

Python

df = pd.read_excel("data/merged_regions.xlsx")
print(df)
# Region  Sales
# East    100
# NaN     200
# NaN     150
# West    80
# NaN     90

# Fix: forward-fill the NaN values created by merged cells
df["Region"] = df["Region"].fillna(method="ffill")
# Or equivalently:
df["Region"] = df["Region"].ffill()

print(df)
# Region  Sales
# East    100
# East    200
# East    150
# West    80
# West    90

df = pd.read_excel("data/merged_regions.xlsx")
print(df)
# Region  Sales
# East    100
# NaN     200
# NaN     150
# West    80
# NaN     90

# Fix: forward-fill the NaN values created by merged cells
df["Region"] = df["Region"].fillna(method="ffill")
# Or equivalently:
df["Region"] = df["Region"].ffill()

print(df)
# Region  Sales
# East    100
# East    200
# East    150
# West    80
# West    90

Pattern 4: Reading All Sheets and Combining

When a workbook has one sheet per time period (monthly, quarterly, yearly):

Python

import pandas as pd

def read_and_combine_sheets(filepath: str,
                            add_sheet_column: bool = True,
                            sheet_column_name: str = "sheet") -> pd.DataFrame:
    """
    Read all sheets from an Excel file and combine into one DataFrame.

    Parameters
    ----------
    filepath : str
        Path to the Excel workbook.
    add_sheet_column : bool
        If True, add a column with the originating sheet name.
    sheet_column_name : str
        Name for the sheet name column.

    Returns
    -------
    pd.DataFrame
        All sheets combined vertically.
    """
    all_sheets = pd.read_excel(filepath, sheet_name=None)
    dfs = []

    for sheet_name, df in all_sheets.items():
        if df.empty:
            continue
        if add_sheet_column:
            df[sheet_column_name] = sheet_name
        dfs.append(df)

    if not dfs:
        raise ValueError("No non-empty sheets found")

    combined = pd.concat(dfs, ignore_index=True)
    print(f"Combined {len(dfs)} sheets: {combined.shape}")
    return combined

# Combine all monthly sales sheets into one DataFrame
df_all = read_and_combine_sheets("data/monthly_sales_2024.xlsx",
                                  sheet_column_name="month")
print(df_all["month"].value_counts())

import pandas as pd

def read_and_combine_sheets(filepath: str,
                            add_sheet_column: bool = True,
                            sheet_column_name: str = "sheet") -> pd.DataFrame:
    """
    Read all sheets from an Excel file and combine into one DataFrame.

    Parameters
    ----------
    filepath : str
        Path to the Excel workbook.
    add_sheet_column : bool
        If True, add a column with the originating sheet name.
    sheet_column_name : str
        Name for the sheet name column.

    Returns
    -------
    pd.DataFrame
        All sheets combined vertically.
    """
    all_sheets = pd.read_excel(filepath, sheet_name=None)
    dfs = []

    for sheet_name, df in all_sheets.items():
        if df.empty:
            continue
        if add_sheet_column:
            df[sheet_column_name] = sheet_name
        dfs.append(df)

    if not dfs:
        raise ValueError("No non-empty sheets found")

    combined = pd.concat(dfs, ignore_index=True)
    print(f"Combined {len(dfs)} sheets: {combined.shape}")
    return combined

# Combine all monthly sales sheets into one DataFrame
df_all = read_and_combine_sheets("data/monthly_sales_2024.xlsx",
                                  sheet_column_name="month")
print(df_all["month"].value_counts())

Writing Excel Files: to_excel() and ExcelWriter

Basic Writing

Python

import pandas as pd

df = pd.DataFrame({
    "customer_id":  ["CUST_001", "CUST_002", "CUST_003"],
    "name":         ["Jane Smith", "Bob Johnson", "Alice Williams"],
    "total_spend":  [384.96, 139.98, 269.97],
    "signup_date":  pd.to_datetime(["2022-03-14", "2021-11-28", "2023-01-05"]),
    "is_premium":   [True, False, True]
})

# Simple write — always set index=False to avoid writing the row numbers as a column
df.to_excel("output/customers.xlsx", index=False)

# Write to a specific sheet with a custom name
df.to_excel(
    "output/customers.xlsx",
    sheet_name="Customer Data",
    index=False
)

# Freeze the header row (stays visible when scrolling)
# This requires ExcelWriter — see next section

import pandas as pd

df = pd.DataFrame({
    "customer_id":  ["CUST_001", "CUST_002", "CUST_003"],
    "name":         ["Jane Smith", "Bob Johnson", "Alice Williams"],
    "total_spend":  [384.96, 139.98, 269.97],
    "signup_date":  pd.to_datetime(["2022-03-14", "2021-11-28", "2023-01-05"]),
    "is_premium":   [True, False, True]
})

# Simple write — always set index=False to avoid writing the row numbers as a column
df.to_excel("output/customers.xlsx", index=False)

# Write to a specific sheet with a custom name
df.to_excel(
    "output/customers.xlsx",
    sheet_name="Customer Data",
    index=False
)

# Freeze the header row (stays visible when scrolling)
# This requires ExcelWriter — see next section

Writing Multiple Sheets with ExcelWriter

pd.ExcelWriter is the tool for writing multiple DataFrames to separate sheets in a single Excel file:

Python

import pandas as pd

# Sample DataFrames
summary_df = pd.DataFrame({
    "metric": ["Total Customers", "Total Revenue", "Avg Order Value", "Premium %"],
    "value":  [8, 1029.84, 102.98, 50.0]
})

customers_df = pd.DataFrame({
    "customer_id": ["CUST_001", "CUST_002", "CUST_003"],
    "name":        ["Jane Smith", "Bob Johnson", "Alice Williams"],
    "city":        ["Austin", "Seattle", "Chicago"],
    "total_spend": [384.96, 139.98, 269.97]
})

orders_df = pd.DataFrame({
    "order_id":   ["ORD_1001", "ORD_1002", "ORD_1003"],
    "customer_id":["CUST_001", "CUST_002", "CUST_001"],
    "amount":     [199.98, 49.99, 149.99],
    "status":     ["delivered", "delivered", "delivered"]
})

# Write all three DataFrames to separate sheets
with pd.ExcelWriter("output/full_report.xlsx", engine="openpyxl") as writer:
    summary_df.to_excel(writer, sheet_name="Summary",   index=False)
    customers_df.to_excel(writer, sheet_name="Customers", index=False)
    orders_df.to_excel(writer, sheet_name="Orders",     index=False)

print("Multi-sheet workbook written successfully.")

import pandas as pd

# Sample DataFrames
summary_df = pd.DataFrame({
    "metric": ["Total Customers", "Total Revenue", "Avg Order Value", "Premium %"],
    "value":  [8, 1029.84, 102.98, 50.0]
})

customers_df = pd.DataFrame({
    "customer_id": ["CUST_001", "CUST_002", "CUST_003"],
    "name":        ["Jane Smith", "Bob Johnson", "Alice Williams"],
    "city":        ["Austin", "Seattle", "Chicago"],
    "total_spend": [384.96, 139.98, 269.97]
})

orders_df = pd.DataFrame({
    "order_id":   ["ORD_1001", "ORD_1002", "ORD_1003"],
    "customer_id":["CUST_001", "CUST_002", "CUST_001"],
    "amount":     [199.98, 49.99, 149.99],
    "status":     ["delivered", "delivered", "delivered"]
})

# Write all three DataFrames to separate sheets
with pd.ExcelWriter("output/full_report.xlsx", engine="openpyxl") as writer:
    summary_df.to_excel(writer, sheet_name="Summary",   index=False)
    customers_df.to_excel(writer, sheet_name="Customers", index=False)
    orders_df.to_excel(writer, sheet_name="Orders",     index=False)

print("Multi-sheet workbook written successfully.")

The with block ensures the file is properly saved and closed even if an error occurs during writing.

Appending to an Existing Excel File

Python

# Open an existing file and add a new sheet without touching existing sheets
with pd.ExcelWriter(
    "output/full_report.xlsx",
    engine="openpyxl",
    mode="a"                    # "a" = append mode (don't overwrite)
) as writer:
    new_predictions_df.to_excel(writer, sheet_name="Churn Predictions", index=False)

print("New sheet added to existing workbook.")

# Open an existing file and add a new sheet without touching existing sheets
with pd.ExcelWriter(
    "output/full_report.xlsx",
    engine="openpyxl",
    mode="a"                    # "a" = append mode (don't overwrite)
) as writer:
    new_predictions_df.to_excel(writer, sheet_name="Churn Predictions", index=False)

print("New sheet added to existing workbook.")

Important: If you open an existing Excel file with mode="a" and write to a sheet name that already exists, it raises an error. Use if_sheet_exists="replace" to overwrite:

Python

with pd.ExcelWriter(
    "output/report.xlsx",
    engine="openpyxl",
    mode="a",
    if_sheet_exists="replace"  # Overwrite if sheet already exists
) as writer:
    updated_summary.to_excel(writer, sheet_name="Summary", index=False)

with pd.ExcelWriter(
    "output/report.xlsx",
    engine="openpyxl",
    mode="a",
    if_sheet_exists="replace"  # Overwrite if sheet already exists
) as writer:
    updated_summary.to_excel(writer, sheet_name="Summary", index=False)

Writing to a Specific Position (startrow, startcol)

Write a DataFrame at a specific row and column offset — useful when adding a title row or multiple tables on one sheet:

Python

with pd.ExcelWriter("output/positioned_report.xlsx", engine="openpyxl") as writer:
    # Write a title manually (we'll add it via openpyxl formatting)
    # Leave row 0 for the title, start data at row 1 (Excel row 2)
    df.to_excel(
        writer,
        sheet_name="Report",
        startrow=1,      # Start data at row index 1 (Excel row 2)
        startcol=0,
        index=False
    )

    # Write a second table to the right of the first
    summary_df.to_excel(
        writer,
        sheet_name="Report",
        startrow=1,
        startcol=df.shape[1] + 2,   # Two columns gap to the right
        index=False
    )

with pd.ExcelWriter("output/positioned_report.xlsx", engine="openpyxl") as writer:
    # Write a title manually (we'll add it via openpyxl formatting)
    # Leave row 0 for the title, start data at row 1 (Excel row 2)
    df.to_excel(
        writer,
        sheet_name="Report",
        startrow=1,      # Start data at row index 1 (Excel row 2)
        startcol=0,
        index=False
    )

    # Write a second table to the right of the first
    summary_df.to_excel(
        writer,
        sheet_name="Report",
        startrow=1,
        startcol=df.shape[1] + 2,   # Two columns gap to the right
        index=False
    )

Formatting Excel Output with openpyxl

pd.to_excel() writes clean data but produces plain, unformatted spreadsheets. For professional-looking output with headers styled, columns auto-sized, and number formats applied, use openpyxl directly after writing with ExcelWriter.

Accessing the Workbook After Writing

Python

import pandas as pd
from openpyxl import load_workbook
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
from openpyxl.utils import get_column_letter

# Step 1: Write data with pandas
df.to_excel("output/formatted_report.xlsx", sheet_name="Data", index=False)

# Step 2: Open with openpyxl for formatting
wb = load_workbook("output/formatted_report.xlsx")
ws = wb["Data"]

import pandas as pd
from openpyxl import load_workbook
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
from openpyxl.utils import get_column_letter

# Step 1: Write data with pandas
df.to_excel("output/formatted_report.xlsx", sheet_name="Data", index=False)

# Step 2: Open with openpyxl for formatting
wb = load_workbook("output/formatted_report.xlsx")
ws = wb["Data"]

Styling Headers

Python

# Style the header row
header_font = Font(bold=True, color="FFFFFF", size=11)
header_fill = PatternFill(fill_type="solid", start_color="2F75B6")  # Blue background
header_alignment = Alignment(horizontal="center", vertical="center")

for cell in ws[1]:   # Row 1 is the header row (1-indexed in openpyxl)
    cell.font = header_font
    cell.fill = header_fill
    cell.alignment = header_alignment

# Set header row height
ws.row_dimensions[1].height = 20

# Style the header row
header_font = Font(bold=True, color="FFFFFF", size=11)
header_fill = PatternFill(fill_type="solid", start_color="2F75B6")  # Blue background
header_alignment = Alignment(horizontal="center", vertical="center")

for cell in ws[1]:   # Row 1 is the header row (1-indexed in openpyxl)
    cell.font = header_font
    cell.fill = header_fill
    cell.alignment = header_alignment

# Set header row height
ws.row_dimensions[1].height = 20

Auto-Sizing Column Widths

openpyxl doesn’t auto-fit columns natively, but this function calculates appropriate widths:

Python

def auto_size_columns(worksheet, min_width: int = 10, max_width: int = 40):
    """
    Auto-size all columns in a worksheet based on cell content length.
    """
    for column_cells in worksheet.columns:
        max_length = 0
        col_letter = get_column_letter(column_cells[0].column)

        for cell in column_cells:
            if cell.value is not None:
                try:
                    cell_length = len(str(cell.value))
                    max_length = max(max_length, cell_length)
                except Exception:
                    pass

        adjusted_width = max(min_width, min(max_length + 2, max_width))
        worksheet.column_dimensions[col_letter].width = adjusted_width

auto_size_columns(ws)

def auto_size_columns(worksheet, min_width: int = 10, max_width: int = 40):
    """
    Auto-size all columns in a worksheet based on cell content length.
    """
    for column_cells in worksheet.columns:
        max_length = 0
        col_letter = get_column_letter(column_cells[0].column)

        for cell in column_cells:
            if cell.value is not None:
                try:
                    cell_length = len(str(cell.value))
                    max_length = max(max_length, cell_length)
                except Exception:
                    pass

        adjusted_width = max(min_width, min(max_length + 2, max_width))
        worksheet.column_dimensions[col_letter].width = adjusted_width

auto_size_columns(ws)

Number Formatting

Python

# Apply number formats to specific columns
# openpyxl uses Excel's built-in number format codes

for row in ws.iter_rows(min_row=2, min_col=3, max_col=3):  # Column C, all data rows
    for cell in row:
        cell.number_format = "$#,##0.00"    # Currency with 2 decimal places

for row in ws.iter_rows(min_row=2, min_col=4, max_col=4):  # Column D — percentages
    for cell in row:
        cell.number_format = "0.0%"

for row in ws.iter_rows(min_row=2, min_col=5, max_col=5):  # Column E — dates
    for cell in row:
        cell.number_format = "YYYY-MM-DD"

# Apply number formats to specific columns
# openpyxl uses Excel's built-in number format codes

for row in ws.iter_rows(min_row=2, min_col=3, max_col=3):  # Column C, all data rows
    for cell in row:
        cell.number_format = "$#,##0.00"    # Currency with 2 decimal places

for row in ws.iter_rows(min_row=2, min_col=4, max_col=4):  # Column D — percentages
    for cell in row:
        cell.number_format = "0.0%"

for row in ws.iter_rows(min_row=2, min_col=5, max_col=5):  # Column E — dates
    for cell in row:
        cell.number_format = "YYYY-MM-DD"

Adding Freeze Panes

Freeze the header row (and optionally first column) so they stay visible when scrolling:

Python

# Freeze row 1 (header stays visible when scrolling down)
ws.freeze_panes = "A2"

# Freeze both header row AND first column
ws.freeze_panes = "B2"

# Freeze only the first column
ws.freeze_panes = "B1"

# Freeze row 1 (header stays visible when scrolling down)
ws.freeze_panes = "A2"

# Freeze both header row AND first column
ws.freeze_panes = "B2"

# Freeze only the first column
ws.freeze_panes = "B1"

Adding Alternating Row Colors (Zebra Stripes)

Python

from openpyxl.styles import PatternFill

light_fill = PatternFill(fill_type="solid", start_color="EBF5FB")  # Light blue
white_fill = PatternFill(fill_type="solid", start_color="FFFFFF")  # White

for row_idx in range(2, ws.max_row + 1):   # Start from row 2 (skip header)
    fill = light_fill if row_idx % 2 == 0 else white_fill
    for cell in ws[row_idx]:
        cell.fill = fill

from openpyxl.styles import PatternFill

light_fill = PatternFill(fill_type="solid", start_color="EBF5FB")  # Light blue
white_fill = PatternFill(fill_type="solid", start_color="FFFFFF")  # White

for row_idx in range(2, ws.max_row + 1):   # Start from row 2 (skip header)
    fill = light_fill if row_idx % 2 == 0 else white_fill
    for cell in ws[row_idx]:
        cell.fill = fill

Complete Formatting Example

Python

import pandas as pd
from openpyxl import load_workbook
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
from openpyxl.utils import get_column_letter

def write_formatted_excel(df: pd.DataFrame, filepath: str,
                           sheet_name: str = "Data",
                           title: str = None) -> None:
    """
    Write a DataFrame to a professionally formatted Excel file.

    Parameters
    ----------
    df : pd.DataFrame
        Data to write.
    filepath : str
        Output file path.
    sheet_name : str
        Name for the sheet.
    title : str, optional
        Title to display above the table.
    """
    # Write data with pandas
    start_row = 2 if title else 0   # Leave row 1 for title if provided
    df.to_excel(filepath, sheet_name=sheet_name,
                startrow=start_row, index=False)

    # Open for formatting
    wb = load_workbook(filepath)
    ws = wb[sheet_name]
    header_row = start_row + 1   # Excel rows are 1-indexed

    # Add title if provided
    if title:
        ws["A1"] = title
        ws["A1"].font = Font(bold=True, size=14)
        ws["A1"].alignment = Alignment(horizontal="left")
        ws.row_dimensions[1].height = 22

    # Style header row
    header_font  = Font(bold=True, color="FFFFFF", size=11)
    header_fill  = PatternFill(fill_type="solid", start_color="2F75B6")
    header_align = Alignment(horizontal="center", vertical="center")

    for cell in ws[header_row]:
        cell.font      = header_font
        cell.fill      = header_fill
        cell.alignment = header_align
    ws.row_dimensions[header_row].height = 18

    # Alternating row colors for data rows
    light_fill = PatternFill(fill_type="solid", start_color="DEEAF1")
    for row_idx in range(header_row + 1, ws.max_row + 1):
        fill = light_fill if (row_idx - header_row) % 2 == 0 else None
        if fill:
            for cell in ws[row_idx]:
                cell.fill = fill

    # Freeze the header row
    ws.freeze_panes = f"A{header_row + 1}"

    # Auto-size columns
    for column_cells in ws.columns:
        max_len = 0
        col_letter = get_column_letter(column_cells[0].column)
        for cell in column_cells:
            if cell.value:
                max_len = max(max_len, len(str(cell.value)))
        ws.column_dimensions[col_letter].width = min(max(max_len + 2, 10), 40)

    wb.save(filepath)
    print(f"Formatted Excel written: {filepath}")


# Usage
df_customers = pd.DataFrame({
    "Customer ID":  ["CUST_001", "CUST_002", "CUST_003"],
    "Name":         ["Jane Smith", "Bob Johnson", "Alice Williams"],
    "City":         ["Austin", "Seattle", "Chicago"],
    "Total Spend":  [384.96, 139.98, 269.97],
    "Premium":      [True, False, True],
    "Last Order":   pd.to_datetime(["2024-09-10", "2024-09-01", "2024-09-15"])
})

write_formatted_excel(
    df_customers,
    "output/customer_report_formatted.xlsx",
    title="Customer Summary — September 2024"
)

import pandas as pd
from openpyxl import load_workbook
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
from openpyxl.utils import get_column_letter

def write_formatted_excel(df: pd.DataFrame, filepath: str,
                           sheet_name: str = "Data",
                           title: str = None) -> None:
    """
    Write a DataFrame to a professionally formatted Excel file.

    Parameters
    ----------
    df : pd.DataFrame
        Data to write.
    filepath : str
        Output file path.
    sheet_name : str
        Name for the sheet.
    title : str, optional
        Title to display above the table.
    """
    # Write data with pandas
    start_row = 2 if title else 0   # Leave row 1 for title if provided
    df.to_excel(filepath, sheet_name=sheet_name,
                startrow=start_row, index=False)

    # Open for formatting
    wb = load_workbook(filepath)
    ws = wb[sheet_name]
    header_row = start_row + 1   # Excel rows are 1-indexed

    # Add title if provided
    if title:
        ws["A1"] = title
        ws["A1"].font = Font(bold=True, size=14)
        ws["A1"].alignment = Alignment(horizontal="left")
        ws.row_dimensions[1].height = 22

    # Style header row
    header_font  = Font(bold=True, color="FFFFFF", size=11)
    header_fill  = PatternFill(fill_type="solid", start_color="2F75B6")
    header_align = Alignment(horizontal="center", vertical="center")

    for cell in ws[header_row]:
        cell.font      = header_font
        cell.fill      = header_fill
        cell.alignment = header_align
    ws.row_dimensions[header_row].height = 18

    # Alternating row colors for data rows
    light_fill = PatternFill(fill_type="solid", start_color="DEEAF1")
    for row_idx in range(header_row + 1, ws.max_row + 1):
        fill = light_fill if (row_idx - header_row) % 2 == 0 else None
        if fill:
            for cell in ws[row_idx]:
                cell.fill = fill

    # Freeze the header row
    ws.freeze_panes = f"A{header_row + 1}"

    # Auto-size columns
    for column_cells in ws.columns:
        max_len = 0
        col_letter = get_column_letter(column_cells[0].column)
        for cell in column_cells:
            if cell.value:
                max_len = max(max_len, len(str(cell.value)))
        ws.column_dimensions[col_letter].width = min(max(max_len + 2, 10), 40)

    wb.save(filepath)
    print(f"Formatted Excel written: {filepath}")


# Usage
df_customers = pd.DataFrame({
    "Customer ID":  ["CUST_001", "CUST_002", "CUST_003"],
    "Name":         ["Jane Smith", "Bob Johnson", "Alice Williams"],
    "City":         ["Austin", "Seattle", "Chicago"],
    "Total Spend":  [384.96, 139.98, 269.97],
    "Premium":      [True, False, True],
    "Last Order":   pd.to_datetime(["2024-09-10", "2024-09-01", "2024-09-15"])
})

write_formatted_excel(
    df_customers,
    "output/customer_report_formatted.xlsx",
    title="Customer Summary — September 2024"
)

Common Problems and Solutions

Problem 1: Mixed Data Types in a Column

Excel allows mixed types within a column (some cells text, others numbers). pandas infers a type per column, which can cause unexpected results:

Python

# Column "zip_code" inferred as int64 — drops leading zeros
# CUST_001: 02134 → 2134 (Boston ZIP incorrectly truncated)

# Fix: Force string type for columns that look numeric but aren't
df = pd.read_excel("data/customers.xlsx", dtype={"zip_code": str})

# Result: "02134" preserved as string

# Column "zip_code" inferred as int64 — drops leading zeros
# CUST_001: 02134 → 2134 (Boston ZIP incorrectly truncated)

# Fix: Force string type for columns that look numeric but aren't
df = pd.read_excel("data/customers.xlsx", dtype={"zip_code": str})

# Result: "02134" preserved as string

Problem 2: Dates Stored as Numbers

Excel sometimes exports dates as serial numbers (days since 1900-01-01) rather than date strings:

Python

# Column reads as 45183 instead of 2023-09-15

# Fix 1: Parse dates automatically
df = pd.read_excel("data/orders.xlsx", parse_dates=["order_date"])

# Fix 2: If auto-parsing fails, convert the Excel date serial manually
from datetime import datetime, timedelta

def excel_date_to_datetime(excel_serial: int) -> datetime:
    """Convert Excel date serial number to <a href="https://techietory.com/data-science/working-with-date-and-time-in-python/">Python datetime</a>."""
    # Excel serial 1 = Jan 1, 1900; but Excel incorrectly treats 1900 as a leap year
    return datetime(1899, 12, 30) + timedelta(days=int(excel_serial))

df["order_date"] = df["order_date"].apply(excel_date_to_datetime)

# Column reads as 45183 instead of 2023-09-15

# Fix 1: Parse dates automatically
df = pd.read_excel("data/orders.xlsx", parse_dates=["order_date"])

# Fix 2: If auto-parsing fails, convert the Excel date serial manually
from datetime import datetime, timedelta

def excel_date_to_datetime(excel_serial: int) -> datetime:
    """Convert Excel date serial number to Python datetime."""
    # Excel serial 1 = Jan 1, 1900; but Excel incorrectly treats 1900 as a leap year
    return datetime(1899, 12, 30) + timedelta(days=int(excel_serial))

df["order_date"] = df["order_date"].apply(excel_date_to_datetime)

Problem 3: Formula Cells Return NaN

When an Excel file contains formulas, openpyxl reads the cached formula string, not the calculated value, when opening normally:

Python

from openpyxl import load_workbook

# Reading a file with formulas
wb = load_workbook("data/formula_report.xlsx")
ws = wb.active
print(ws["B5"].value)  # Returns "=SUM(B2:B4)" — the formula string!

# Fix: Use data_only=True to read the cached calculated values
wb = load_workbook("data/formula_report.xlsx", data_only=True)
ws = wb.active
print(ws["B5"].value)  # Returns 157.98 — the cached calculated value
# Note: cached values are from the last time the file was opened in Excel

from openpyxl import load_workbook

# Reading a file with formulas
wb = load_workbook("data/formula_report.xlsx")
ws = wb.active
print(ws["B5"].value)  # Returns "=SUM(B2:B4)" — the formula string!

# Fix: Use data_only=True to read the cached calculated values
wb = load_workbook("data/formula_report.xlsx", data_only=True)
ws = wb.active
print(ws["B5"].value)  # Returns 157.98 — the cached calculated value
# Note: cached values are from the last time the file was opened in Excel

For data science purposes, use data_only=True when you want the calculated values rather than the formulas.

Problem 4: File Already Open in Excel (Windows)

On Windows, if the Excel file is open in the Excel application, Python will fail to write to it:

Python

try:
    df.to_excel("output/report.xlsx", index=False)
except PermissionError:
    print("ERROR: The file is open in Excel. Close it and try again.")
    # Or write to a differently named file:
    df.to_excel("output/report_new.xlsx", index=False)

try:
    df.to_excel("output/report.xlsx", index=False)
except PermissionError:
    print("ERROR: The file is open in Excel. Close it and try again.")
    # Or write to a differently named file:
    df.to_excel("output/report_new.xlsx", index=False)

Problem 5: Unnamed Columns from Merged Headers

When a workbook has merged header cells, the merged region typically shows the value in the first cell only, leaving the subsequent cells empty — which pandas reads as Unnamed: 1, Unnamed: 2, etc.:

Python

# Read the file first to see the actual column names
df_check = pd.read_excel("data/merged_header_report.xlsx", nrows=2)
print(df_check.columns.tolist())
# ['Region', 'Unnamed: 1', 'Q1 Revenue', 'Q2 Revenue', ...]

# Rename columns manually after reading
df = pd.read_excel("data/merged_header_report.xlsx", skiprows=1)
df.columns = ["Region", "Product", "Q1_Revenue", "Q2_Revenue", "Q3_Revenue"]

# Or use a forward-fill approach on the header row
df_raw = pd.read_excel("data/merged_header_report.xlsx", header=None, nrows=2)
headers = df_raw.iloc[0].fillna(method="ffill")  # Forward-fill the merged cells
print(headers.tolist())

# Read the file first to see the actual column names
df_check = pd.read_excel("data/merged_header_report.xlsx", nrows=2)
print(df_check.columns.tolist())
# ['Region', 'Unnamed: 1', 'Q1 Revenue', 'Q2 Revenue', ...]

# Rename columns manually after reading
df = pd.read_excel("data/merged_header_report.xlsx", skiprows=1)
df.columns = ["Region", "Product", "Q1_Revenue", "Q2_Revenue", "Q3_Revenue"]

# Or use a forward-fill approach on the header row
df_raw = pd.read_excel("data/merged_header_report.xlsx", header=None, nrows=2)
headers = df_raw.iloc[0].fillna(method="ffill")  # Forward-fill the merged cells
print(headers.tolist())

Problem 6: Large Excel Files Running Slowly

Excel files with hundreds of thousands of rows load slowly because Excel format is not designed for performance. Solutions:

Python

# Option 1: Read only the columns you need
df = pd.read_excel("data/large_report.xlsx", usecols=["date", "revenue", "region"])

# Option 2: Convert to CSV or Parquet first (much faster for large data)
# If you have to work with a huge Excel file repeatedly:
import pandas as pd

# Do this once
df = pd.read_excel("data/massive_file.xlsx")
df.to_parquet("data/massive_file.parquet")  # Save as Parquet

# All subsequent reads are 10-100x faster
df = pd.read_parquet("data/massive_file.parquet")

# Option 3: Use read_only mode in openpyxl for very large files
from openpyxl import load_workbook

wb = load_workbook("data/massive_file.xlsx", read_only=True)
ws = wb.active

# Iterate rows without loading whole file into memory
for row in ws.iter_rows(values_only=True):
    process_row(row)

wb.close()

# Option 1: Read only the columns you need
df = pd.read_excel("data/large_report.xlsx", usecols=["date", "revenue", "region"])

# Option 2: Convert to CSV or Parquet first (much faster for large data)
# If you have to work with a huge Excel file repeatedly:
import pandas as pd

# Do this once
df = pd.read_excel("data/massive_file.xlsx")
df.to_parquet("data/massive_file.parquet")  # Save as Parquet

# All subsequent reads are 10-100x faster
df = pd.read_parquet("data/massive_file.parquet")

# Option 3: Use read_only mode in openpyxl for very large files
from openpyxl import load_workbook

wb = load_workbook("data/massive_file.xlsx", read_only=True)
ws = wb.active

# Iterate rows without loading whole file into memory
for row in ws.iter_rows(values_only=True):
    process_row(row)

wb.close()

Working with Excel Tables (ListObjects)

Excel has a native “Table” feature (Insert → Table) that creates structured table objects with named ranges and automatic filtering. openpyxl can work with these:

Python

from openpyxl import load_workbook
from openpyxl.worksheet.table import Table, TableStyleInfo

# Read data from a named Excel table
wb = load_workbook("data/workbook_with_tables.xlsx")
ws = wb.active

# List all tables in the sheet
for table_name, table in ws.tables.items():
    print(f"Table: {table_name}, Range: {table.ref}")

# Create an Excel Table when writing
with pd.ExcelWriter("output/table_report.xlsx", engine="openpyxl") as writer:
    df.to_excel(writer, sheet_name="Sales", index=False)

# Open the file to add Table formatting
wb = load_workbook("output/table_report.xlsx")
ws = wb["Sales"]

# Add an Excel Table over the data range
tab = Table(
    displayName="SalesTable",
    ref=f"A1:{get_column_letter(ws.max_column)}{ws.max_row}"
)

# Apply a built-in table style
style = TableStyleInfo(
    name="TableStyleMedium9",  # Blue medium style
    showFirstColumn=False,
    showLastColumn=False,
    showRowStripes=True,
    showColumnStripes=False
)
tab.tableStyleInfo = style
ws.add_table(tab)
ws.freeze_panes = "A2"

wb.save("output/table_report.xlsx")

from openpyxl import load_workbook
from openpyxl.worksheet.table import Table, TableStyleInfo

# Read data from a named Excel table
wb = load_workbook("data/workbook_with_tables.xlsx")
ws = wb.active

# List all tables in the sheet
for table_name, table in ws.tables.items():
    print(f"Table: {table_name}, Range: {table.ref}")

# Create an Excel Table when writing
with pd.ExcelWriter("output/table_report.xlsx", engine="openpyxl") as writer:
    df.to_excel(writer, sheet_name="Sales", index=False)

# Open the file to add Table formatting
wb = load_workbook("output/table_report.xlsx")
ws = wb["Sales"]

# Add an Excel Table over the data range
tab = Table(
    displayName="SalesTable",
    ref=f"A1:{get_column_letter(ws.max_column)}{ws.max_row}"
)

# Apply a built-in table style
style = TableStyleInfo(
    name="TableStyleMedium9",  # Blue medium style
    showFirstColumn=False,
    showLastColumn=False,
    showRowStripes=True,
    showColumnStripes=False
)
tab.tableStyleInfo = style
ws.add_table(tab)
ws.freeze_panes = "A2"

wb.save("output/table_report.xlsx")

A Complete Pipeline Example

A realistic end-to-end example: read messy Excel input, clean it, analyze it, and write formatted output:

Python

import pandas as pd
from openpyxl import load_workbook
from openpyxl.styles import Font, PatternFill, Alignment
from openpyxl.utils import get_column_letter
from pathlib import Path

def process_sales_excel(input_path: str, output_path: str) -> dict:
    """
    Complete pipeline: read → clean → analyze → write formatted output.

    Reads a messy monthly sales Excel file (title row, merged regions,
    multiple sheets), produces a clean summary workbook.
    """

    # ── Step 1: Read all sheets ────────────────────────────────────
    print(f"Reading {input_path}...")
    all_sheets = pd.read_excel(input_path, sheet_name=None)
    print(f"Sheets found: {list(all_sheets.keys())}")

    # ── Step 2: Combine monthly sheets ────────────────────────────
    monthly_dfs = []
    for sheet_name, df in all_sheets.items():
        if df.empty or sheet_name.lower() == "summary":
            continue
        # Each sheet has 2 title rows — skip them
        df = pd.read_excel(input_path, sheet_name=sheet_name,
                           skiprows=2, dtype={"product_id": str})
        df["month"] = sheet_name
        monthly_dfs.append(df)

    combined = pd.concat(monthly_dfs, ignore_index=True)
    print(f"Combined: {combined.shape}")

    # ── Step 3: Clean ──────────────────────────────────────────────
    combined.columns = combined.columns.str.strip().str.lower().str.replace(" ", "_")
    combined = combined.dropna(subset=["product_id", "revenue"])
    combined["revenue"] = pd.to_numeric(combined["revenue"], errors="coerce")
    combined["units"]   = pd.to_numeric(combined["units"],   errors="coerce")
    combined["date"]    = pd.to_datetime(combined["date"],   errors="coerce")

    # ── Step 4: Analyze ────────────────────────────────────────────
    by_region = (
        combined
        .groupby("region")
        .agg(
            total_revenue=("revenue", "sum"),
            total_units=("units", "sum"),
            avg_order=("revenue", "mean"),
            num_transactions=("product_id", "count")
        )
        .round(2)
        .sort_values("total_revenue", ascending=False)
        .reset_index()
    )

    by_product = (
        combined
        .groupby(["product_id", "product_name"])
        .agg(
            total_revenue=("revenue", "sum"),
            total_units=("units", "sum")
        )
        .round(2)
        .sort_values("total_revenue", ascending=False)
        .reset_index()
    )

    # ── Step 5: Write formatted output ────────────────────────────
    Path(output_path).parent.mkdir(parents=True, exist_ok=True)

    with pd.ExcelWriter(output_path, engine="openpyxl") as writer:
        combined.to_excel(writer,    sheet_name="Raw Data",        index=False)
        by_region.to_excel(writer,   sheet_name="By Region",       index=False)
        by_product.to_excel(writer,  sheet_name="By Product",      index=False)

    # Apply formatting
    wb = load_workbook(output_path)
    for sheet_name in wb.sheetnames:
        ws = wb[sheet_name]
        # Style header row
        for cell in ws[1]:
            cell.font  = Font(bold=True, color="FFFFFF")
            cell.fill  = PatternFill("solid", start_color="2F75B6")
            cell.alignment = Alignment(horizontal="center")
        # Freeze header
        ws.freeze_panes = "A2"
        # Auto-size columns
        for col in ws.columns:
            col_letter = get_column_letter(col[0].column)
            max_len = max((len(str(c.value)) for c in col if c.value), default=8)
            ws.column_dimensions[col_letter].width = min(max_len + 2, 40)

    wb.save(output_path)
    print(f"Output written: {output_path}")

    return {
        "total_revenue":    combined["revenue"].sum(),
        "total_units":      combined["units"].sum(),
        "num_transactions": len(combined),
        "num_regions":      combined["region"].nunique()
    }

# Run the pipeline
# metrics = process_sales_excel(
#     "data/raw/sales_2024.xlsx",
#     "output/sales_analysis_2024.xlsx"
# )
# print(f"Total revenue: ${metrics['total_revenue']:,.2f}")

import pandas as pd
from openpyxl import load_workbook
from openpyxl.styles import Font, PatternFill, Alignment
from openpyxl.utils import get_column_letter
from pathlib import Path

def process_sales_excel(input_path: str, output_path: str) -> dict:
    """
    Complete pipeline: read → clean → analyze → write formatted output.

    Reads a messy monthly sales Excel file (title row, merged regions,
    multiple sheets), produces a clean summary workbook.
    """

    # ── Step 1: Read all sheets ────────────────────────────────────
    print(f"Reading {input_path}...")
    all_sheets = pd.read_excel(input_path, sheet_name=None)
    print(f"Sheets found: {list(all_sheets.keys())}")

    # ── Step 2: Combine monthly sheets ────────────────────────────
    monthly_dfs = []
    for sheet_name, df in all_sheets.items():
        if df.empty or sheet_name.lower() == "summary":
            continue
        # Each sheet has 2 title rows — skip them
        df = pd.read_excel(input_path, sheet_name=sheet_name,
                           skiprows=2, dtype={"product_id": str})
        df["month"] = sheet_name
        monthly_dfs.append(df)

    combined = pd.concat(monthly_dfs, ignore_index=True)
    print(f"Combined: {combined.shape}")

    # ── Step 3: Clean ──────────────────────────────────────────────
    combined.columns = combined.columns.str.strip().str.lower().str.replace(" ", "_")
    combined = combined.dropna(subset=["product_id", "revenue"])
    combined["revenue"] = pd.to_numeric(combined["revenue"], errors="coerce")
    combined["units"]   = pd.to_numeric(combined["units"],   errors="coerce")
    combined["date"]    = pd.to_datetime(combined["date"],   errors="coerce")

    # ── Step 4: Analyze ────────────────────────────────────────────
    by_region = (
        combined
        .groupby("region")
        .agg(
            total_revenue=("revenue", "sum"),
            total_units=("units", "sum"),
            avg_order=("revenue", "mean"),
            num_transactions=("product_id", "count")
        )
        .round(2)
        .sort_values("total_revenue", ascending=False)
        .reset_index()
    )

    by_product = (
        combined
        .groupby(["product_id", "product_name"])
        .agg(
            total_revenue=("revenue", "sum"),
            total_units=("units", "sum")
        )
        .round(2)
        .sort_values("total_revenue", ascending=False)
        .reset_index()
    )

    # ── Step 5: Write formatted output ────────────────────────────
    Path(output_path).parent.mkdir(parents=True, exist_ok=True)

    with pd.ExcelWriter(output_path, engine="openpyxl") as writer:
        combined.to_excel(writer,    sheet_name="Raw Data",        index=False)
        by_region.to_excel(writer,   sheet_name="By Region",       index=False)
        by_product.to_excel(writer,  sheet_name="By Product",      index=False)

    # Apply formatting
    wb = load_workbook(output_path)
    for sheet_name in wb.sheetnames:
        ws = wb[sheet_name]
        # Style header row
        for cell in ws[1]:
            cell.font  = Font(bold=True, color="FFFFFF")
            cell.fill  = PatternFill("solid", start_color="2F75B6")
            cell.alignment = Alignment(horizontal="center")
        # Freeze header
        ws.freeze_panes = "A2"
        # Auto-size columns
        for col in ws.columns:
            col_letter = get_column_letter(col[0].column)
            max_len = max((len(str(c.value)) for c in col if c.value), default=8)
            ws.column_dimensions[col_letter].width = min(max_len + 2, 40)

    wb.save(output_path)
    print(f"Output written: {output_path}")

    return {
        "total_revenue":    combined["revenue"].sum(),
        "total_units":      combined["units"].sum(),
        "num_transactions": len(combined),
        "num_regions":      combined["region"].nunique()
    }

# Run the pipeline
# metrics = process_sales_excel(
#     "data/raw/sales_2024.xlsx",
#     "output/sales_analysis_2024.xlsx"
# )
# print(f"Total revenue: ${metrics['total_revenue']:,.2f}")

Summary

Excel file handling in pandas combines the powerful data manipulation capabilities of pandas with openpyxl’s Excel-specific features. pd.read_excel() handles the full range of real-world Excel files — from clean single-table files to messy multi-header, merged-cell, multi-table workbooks — through its extensive parameter set. pd.ExcelWriter writes clean, multi-sheet workbooks, and openpyxl adds the professional formatting layer (styled headers, auto-sized columns, freeze panes, number formats) that makes output presentable to non-technical stakeholders.

The practical pattern for most data science Excel work is: read with pd.read_excel() using the appropriate parameters for the file’s structure, clean and transform in pandas, write with ExcelWriter, and optionally apply formatting with openpyxl. The common problems — mixed types, Excel serial dates, formula cells, merged headers — each have straightforward fixes that become automatic with experience.

Key Takeaways

pd.read_excel("file.xlsx") reads the first sheet; use sheet_name="Name" for a specific sheet, or sheet_name=None to read all sheets as a dict of DataFrames
Always specify dtype={"id": str} for columns like customer IDs, ZIP codes, and phone numbers — without it, pandas may infer them as integers, truncating leading zeros and breaking joins
skiprows, header, usecols, nrows, and skipfooter handle virtually any non-standard Excel layout — inspect the file with nrows=10, header=None first to understand the structure
Forward-fill (df["col"].ffill()) fixes NaN values created by merged cells that pandas unmerges on read
pd.ExcelWriter with engine="openpyxl" writes multiple DataFrames to separate sheets; use mode="a" to append new sheets to an existing file without disturbing existing ones
openpyxl post-processing adds professional formatting: Font, PatternFill, Alignment for header styling; freeze_panes to keep headers visible while scrolling; auto-sized column widths; and number formats for currency, percentages, and dates
load_workbook("file.xlsx", data_only=True) reads cached formula results rather than formula strings — use this when you need the calculated values from a formula-heavy workbook
For large Excel files, convert to Parquet once (df.to_parquet()) and read Parquet for all subsequent operations — the speed difference is 10–100×

0 Comments

Discover More

Click For More

Working with Excel Files in Pandas

Introduction

Setting Up: Required Libraries

Reading Excel Files: pd.read_excel()

The Basic Pattern

Specifying the Sheet

Controlling Which Rows and Columns to Read

Controlling Data Types and Parsing

Handling Multiple Header Rows (MultiIndex)

Handling Missing Values

A Complete read_excel() Parameter Reference

Reading Real-World Messy Excel Files

Pattern 1: Title and Metadata Before the Data

Pattern 2: Multiple Tables on One Sheet

Pattern 3: Merged Cells

Pattern 4: Reading All Sheets and Combining

Writing Excel Files: to_excel() and ExcelWriter

Basic Writing

Writing Multiple Sheets with ExcelWriter

Appending to an Existing Excel File

Writing to a Specific Position (startrow, startcol)

Formatting Excel Output with openpyxl

Accessing the Workbook After Writing

Styling Headers

Auto-Sizing Column Widths

Number Formatting

Adding Freeze Panes

Adding Alternating Row Colors (Zebra Stripes)

Complete Formatting Example

Common Problems and Solutions

Problem 1: Mixed Data Types in a Column

Problem 2: Dates Stored as Numbers

Problem 3: Formula Cells Return NaN

Problem 4: File Already Open in Excel (Windows)

Problem 5: Unnamed Columns from Merged Headers

Problem 6: Large Excel Files Running Slowly

Working with Excel Tables (ListObjects)

A Complete Pipeline Example

Summary

Key Takeaways

Discover More

Why Machine Learning?

What is Data Science? A Complete Beginner’s Guide

ETL vs ELT: Understanding Data Pipelines

Exploring GPIO Pins on Raspberry Pi: Basics of General-Purpose Input/Output

Working with Date and Time in Python

10 Essential Jupyter Notebook Tips for Beginners