Parsing XML Data in Python

Learn to parse XML data in Python using ElementTree, lxml, and XPath. Master navigating XML trees, extracting elements and attributes, handling namespaces, and converting XML to DataFrames.

Parsing XML Data in Python

XML (eXtensible Markup Language) is parsed in Python primarily using the built-in xml.etree.ElementTree module, which loads an XML document into a navigable tree of Element objects. Each element has a tag name, a dictionary of attributes, text content, and child elements. You navigate with element.find("tag") for the first matching child, element.findall("tag") for all matching children, and XPath expressions for more complex queries — then extract text with element.text and attributes with element.get("attribute_name"). For complex, namespace-heavy, or very large XML, the lxml library provides faster parsing, full XPath 1.0 support, and streaming parsers.

Introduction

While JSON has become the dominant format for modern web APIs, XML remains deeply embedded in many industries and systems that data scientists regularly encounter. Healthcare systems use HL7 FHIR and CDA standards in XML. Financial data services use FIX protocol and XBRL in XML. Government open data portals commonly distribute data in XML. Microsoft Office documents — Word, Excel, PowerPoint — are ZIP archives of XML files. RSS and Atom feeds are XML. Enterprise software like SAP, Salesforce, and Oracle export data in XML. If you work in any of these domains, XML parsing is a required skill.

XML and JSON share the same fundamental concept — both are text-based formats for representing hierarchical, self-describing data — but they look and behave differently. Where JSON uses curly braces and brackets to define structure, XML uses opening and closing tags. Where JSON has no formal validation mechanism in the format itself, XML has schemas (DTD and XSD) that rigorously define what’s valid. And where JSON flattened quickly into Python’s json module, XML has spawned multiple parsing approaches — DOM, SAX, ElementTree, and XPath — each suited to different use cases.

This article teaches you everything you need to parse XML data in Python for data science purposes. We’ll cover the XML format itself, Python’s built-in ElementTree module for standard parsing, XPath for powerful querying, the lxml library for advanced use cases, handling namespaces (XML’s most confusing feature), converting XML to pandas DataFrames, and processing large XML files without loading everything into memory.

XML Fundamentals: Structure and Syntax

Before writing parsing code, you need to understand XML’s structure. Every XML concept maps to something concrete in the Python objects you’ll work with.

Anatomy of an XML Document

XML
<?xml version="1.0" encoding="UTF-8"?>
<!-- This is a comment -->
<catalog version="2.0" generated="2024-09-15">

    <product id="PROD_001" featured="true">
        <name>Wireless Headphones</name>
        <category>Electronics</category>
        <price currency="USD">149.99</price>
        <description>Premium noise-cancelling headphones with 40-hour battery.</description>
        <specs>
            <spec name="battery_life">40 hours</spec>
            <spec name="connectivity">Bluetooth 5.0</spec>
            <spec name="weight">250g</spec>
        </specs>
        <tags>
            <tag>audio</tag>
            <tag>wireless</tag>
            <tag>premium</tag>
        </tags>
    </product>

    <product id="PROD_002" featured="false">
        <name>USB-C Hub</name>
        <category>Electronics</category>
        <price currency="USD">49.99</price>
        <description>7-in-1 USB-C hub with HDMI, USB 3.0, and SD card reader.</description>
        <specs>
            <spec name="ports">7</spec>
            <spec name="hdmi_output">4K@30Hz</spec>
        </specs>
        <tags>
            <tag>usb</tag>
            <tag>hub</tag>
        </tags>
    </product>

</catalog>

Key XML concepts:

  • Declaration: <?xml version="1.0" encoding="UTF-8"?> — optional header specifying version and encoding
  • Root element: <catalog> — every XML document has exactly one root element that contains all others
  • Elements: Defined by opening (<name>) and closing (</name>) tags, or self-closing (<br/>)
  • Attributes: Key-value pairs inside the opening tag: id="PROD_001", currency="USD"
  • Text content: The text between opening and closing tags: Wireless Headphones
  • Children: Elements nested inside another element (<specs> contains multiple <spec> elements)
  • Comments: <!-- This is a comment --> — ignored by parsers
  • Tail text: Text that appears after a closing tag but before the next sibling tag (an ElementTree-specific concept we’ll cover)

XML vs. JSON: The Structural Difference

The same data in both formats illustrates the key structural differences:

JSON
{
  "product": {
    "id": "PROD_001",
    "name": "Wireless Headphones",
    "price": 149.99
  }
}
JSON
<product id="PROD_001">
    <name>Wireless Headphones</name>
    <price currency="USD">149.99</price>
</product>

In XML, metadata can be stored as attributes (like id or currency) while the main content goes in element text. In JSON, everything is a key-value pair at the same level. This dual storage mechanism (attributes vs. text) is one reason XML requires more thought to parse — you need to know whether a value you want is an attribute or text content.

Python’s Built-in xml.etree.ElementTree

xml.etree.ElementTree is Python’s standard library XML parser. It implements the ElementTree model — loading the entire XML document into memory as a tree of Element objects, which you then navigate and query.

Parsing an XML String or File

Python
import xml.etree.ElementTree as ET

# Parse from a string
xml_string = """
<catalog version="2.0">
    <product id="PROD_001">
        <name>Wireless Headphones</name>
        <price currency="USD">149.99</price>
    </product>
    <product id="PROD_002">
        <name>USB-C Hub</name>
        <price currency="USD">49.99</price>
    </product>
</catalog>
"""

root = ET.fromstring(xml_string)    # Parse string → root Element

# Parse from a file
tree = ET.parse("data/catalog.xml") # Parse file → ElementTree object
root = tree.getroot()               # Get the root Element

print(root.tag)         # 'catalog'
print(root.attrib)      # {'version': '2.0'}
print(len(root))        # 2 (number of direct children)

The Element Object

Every node in the parsed XML tree is an Element object with these key properties:

Python
# Given: <product id="PROD_001" featured="true">
product = root[0]   # First child of root

# Tag name
print(product.tag)           # 'product'

# Attributes dictionary
print(product.attrib)        # {'id': 'PROD_001', 'featured': 'true'}
print(product.get("id"))     # 'PROD_001'
print(product.get("color"))  # None (safe — returns None for missing attributes)
print(product.get("color", "unknown"))  # 'unknown' (with default)

# Text content (direct text between opening and closing tags)
name_elem = product.find("name")
print(name_elem.text)        # 'Wireless Headphones'

# Tail text (text after the element's closing tag, before the next sibling)
# This is rarely needed but important to know about
# In <a>text</a> tail text, "tail text" is the .tail of element <a>

# Children
for child in product:
    print(f"{child.tag}: {child.text}")
# name: Wireless Headphones
# category: Electronics
# price: 149.99
# description: Premium noise-cancelling headphones...
# specs: None       ← specs has child elements, not text
# tags: None        ← tags has child elements, not text

Navigating the Tree: find() and findall()

Python
# find(): returns the FIRST matching child element (or None)
# findall(): returns a LIST of ALL matching child elements

# Find immediate children by tag name
name_elem  = product.find("name")    # First <name> child
price_elem = product.find("price")   # First <price> child

print(name_elem.text)                # 'Wireless Headphones'
print(price_elem.text)               # '149.99'
print(price_elem.get("currency"))    # 'USD'

# Find all products at the root level
all_products = root.findall("product")
print(f"Total products: {len(all_products)}")  # 2

# Find nested elements (one level at a time)
specs = product.find("specs")
all_specs = specs.findall("spec")
for spec in all_specs:
    print(f"{spec.get('name')}: {spec.text}")
# battery_life: 40 hours
# connectivity: Bluetooth 5.0
# weight: 250g

# Find using a simple path (dot-separated tags)
# This navigates: product → specs → spec
all_spec_elems = product.findall("specs/spec")
print(len(all_spec_elems))           # 3

Iterating Over Children

Python
# Iterate over direct children
for child in product:
    print(f"Tag: {child.tag}, Text: {child.text}")

# iter(): recursive iteration over ALL descendants (all levels deep)
for elem in product.iter():
    if elem.text and elem.text.strip():
        print(f"{elem.tag}: {elem.text.strip()}")

# iter() with a tag filter — find ALL elements with a given tag, anywhere in the tree
for spec in root.iter("spec"):   # Find all <spec> elements in the entire document
    print(f"{spec.get('name')}: {spec.text}")

for tag_elem in root.iter("tag"):
    print(tag_elem.text)
# audio, wireless, premium, usb, hub

find() vs. findall() vs. iter(): When to Use Each

MethodScopeReturnsUse When
elem.find("tag")Direct children onlyFirst match or NoneGetting a single known child element
elem.findall("tag")Direct children onlyList of matchesGetting all children with a specific tag
elem.findall("a/b")Two-level pathList of matchesGetting specific nested elements
elem.iter("tag")All descendants (recursive)IteratorFinding all occurrences anywhere in the tree
elem.iter()All descendants (recursive)IteratorTraversing the entire subtree

XPath: Powerful XML Querying

XPath is a query language for XML — like SQL for databases but for navigating and selecting nodes in an XML tree. Python’s ElementTree supports a useful subset of XPath, and lxml supports the full XPath 1.0 standard.

XPath Expressions in ElementTree

Python
import xml.etree.ElementTree as ET

# Parse the full catalog XML
tree = ET.parse("data/catalog.xml")
root = tree.getroot()

# Basic XPath syntax in ElementTree:

# // means "anywhere in the document"
# Not supported in ElementTree — use .iter() instead

# . means "current element"
# * matches any tag name

# Find all <product> children directly under root
products = root.findall("product")

# Find all <spec> elements that are grandchildren (product/specs/spec)
# Using path notation:
all_specs = root.findall("product/specs/spec")

# Find by attribute value — XPath predicate syntax
featured_products = root.findall("product[@featured='true']")
print(f"Featured products: {len(featured_products)}")

# Find specific product by ID attribute
prod_001 = root.find("product[@id='PROD_001']")
print(prod_001.find("name").text)   # 'Wireless Headphones'

# Find products where a specific child exists
# (ElementTree supports this limited form)
products_with_specs = root.findall("product[specs]")

# Find the first child with a specific tag
first_product = root.find("product")

# Wildcard: find any direct child named 'name'
any_name = root.findall("*/name")    # Searches children's children for <name>

# Get the text of all <name> elements under any direct child
names = [elem.text for elem in root.findall("*/name")]
print(names)  # ['Wireless Headphones', 'USB-C Hub']

XPath Predicates for Filtering

Python
# Filter by attribute presence
products_with_featured_attr = root.findall("product[@featured]")

# Filter by attribute value
featured = root.findall("product[@featured='true']")
not_featured = root.findall("product[@featured='false']")

# Filter by child element presence
has_description = root.findall("product[description]")

# Position predicates (ElementTree supports limited position)
first = root.find("product[1]")    # First <product> (1-indexed in XPath!)
second = root.find("product[2]")   # Second <product>
last   = root.find("product[last()]")  # Last <product>

Extracting XML to a pandas DataFrame

The goal for most data science work is transforming XML into a flat DataFrame. Here are the patterns for different XML structures.

Pattern 1: Flat Repeated Elements

When the XML has a list of elements with the same structure:

Python
import xml.etree.ElementTree as ET
import pandas as pd

xml_data = """
<customers>
    <customer id="CUST_001" premium="true">
        <name>Jane Smith</name>
        <email>jane.smith@email.com</email>
        <city>Austin</city>
        <lifetime_value>3847.50</lifetime_value>
    </customer>
    <customer id="CUST_002" premium="false">
        <name>Bob Johnson</name>
        <email>bob.j@email.com</email>
        <city>Seattle</city>
        <lifetime_value>892.00</lifetime_value>
    </customer>
    <customer id="CUST_003" premium="true">
        <name>Alice Williams</name>
        <email>alice.w@email.com</email>
        <city>Chicago</city>
        <lifetime_value>2341.75</lifetime_value>
    </customer>
</customers>
"""

root = ET.fromstring(xml_data)

records = []
for customer in root.findall("customer"):
    record = {
        # Extract attributes
        "customer_id":    customer.get("id"),
        "is_premium":     customer.get("premium") == "true",

        # Extract child element text (with safe fallback using find())
        "name":           customer.findtext("name"),
        "email":          customer.findtext("email"),
        "city":           customer.findtext("city"),
        "lifetime_value": customer.findtext("lifetime_value")
    }
    records.append(record)

df = pd.DataFrame(records)

# Cast types
df["lifetime_value"] = pd.to_numeric(df["lifetime_value"])

print(df)
print(df.dtypes)

Note: element.findtext("tag") is a convenient shorthand for element.find("tag").text that safely returns None instead of raising AttributeError when the element doesn’t exist.

Pattern 2: Nested Elements Flattened into Columns

When child elements contain nested sub-elements you need as separate columns:

Python
xml_products = """
<catalog>
    <product id="PROD_001">
        <name>Wireless Headphones</name>
        <price currency="USD">149.99</price>
        <specs>
            <spec name="battery_life">40 hours</spec>
            <spec name="connectivity">Bluetooth 5.0</spec>
            <spec name="weight">250g</spec>
        </specs>
        <tags>
            <tag>audio</tag>
            <tag>wireless</tag>
        </tags>
    </product>
    <product id="PROD_002">
        <name>USB-C Hub</name>
        <price currency="USD">49.99</price>
        <specs>
            <spec name="ports">7</spec>
            <spec name="hdmi_output">4K@30Hz</spec>
        </specs>
        <tags>
            <tag>usb</tag>
            <tag>hub</tag>
        </tags>
    </product>
</catalog>
"""

root = ET.fromstring(xml_products)

records = []
for product in root.findall("product"):
    # Build base record from attributes and simple children
    record = {
        "product_id": product.get("id"),
        "name":       product.findtext("name"),
        "price":      float(product.findtext("price", "0")),
        "currency":   product.find("price").get("currency") if product.find("price") is not None else None,
    }

    # Flatten key-value spec elements into separate columns
    specs_elem = product.find("specs")
    if specs_elem is not None:
        for spec in specs_elem.findall("spec"):
            col_name = f"spec_{spec.get('name')}"
            record[col_name] = spec.text

    # Collect array-like elements into a joined string
    tags = [tag.text for tag in product.findall("tags/tag")]
    record["tags"] = "|".join(tags)   # Join as pipe-separated string

    records.append(record)

df_products = pd.DataFrame(records)
print(df_products)
print(df_products.columns.tolist())
# ['product_id', 'name', 'price', 'currency',
#  'spec_battery_life', 'spec_connectivity', 'spec_weight',
#  'spec_ports', 'spec_hdmi_output', 'tags']

Pattern 3: Hierarchical Data — One Row Per Leaf

When the XML has parent-child relationships and you need one row per child (like orders → line items):

Python
xml_orders = """
<orders>
    <order id="ORD_1001" customer_id="CUST_001" date="2024-09-01" status="delivered">
        <item product_id="PROD_001" quantity="1" unit_price="149.99"/>
        <item product_id="PROD_007" quantity="2" unit_price="39.99"/>
    </order>
    <order id="ORD_1002" customer_id="CUST_002" date="2024-09-03" status="shipped">
        <item product_id="PROD_002" quantity="1" unit_price="49.99"/>
        <item product_id="PROD_003" quantity="1" unit_price="89.99"/>
        <item product_id="PROD_006" quantity="3" unit_price="29.99"/>
    </order>
</orders>
"""

root = ET.fromstring(xml_orders)

# One row per order item, with parent order fields repeated
line_items = []
for order in root.findall("order"):
    order_id      = order.get("id")
    customer_id   = order.get("customer_id")
    order_date    = order.get("date")
    status        = order.get("status")

    for item in order.findall("item"):
        quantity   = int(item.get("quantity", 0))
        unit_price = float(item.get("unit_price", 0))

        line_items.append({
            "order_id":    order_id,
            "customer_id": customer_id,
            "order_date":  order_date,
            "status":      status,
            "product_id":  item.get("product_id"),
            "quantity":    quantity,
            "unit_price":  unit_price,
            "line_total":  quantity * unit_price
        })

df_items = pd.DataFrame(line_items)
df_items["order_date"] = pd.to_datetime(df_items["order_date"])

print(df_items)
print(f"\nTotal order value: ${df_items['line_total'].sum():.2f}")
order_idcustomer_idorder_datestatusproduct_idquantityunit_priceline_total
ORD_1001CUST_0012024-09-01deliveredPROD_0011149.99149.99
ORD_1001CUST_0012024-09-01deliveredPROD_007239.9979.98
ORD_1002CUST_0022024-09-03shippedPROD_002149.9949.99

Handling XML Namespaces

Namespaces are XML’s most commonly confusing feature. They’re used when combining XML vocabularies from multiple sources (e.g., SOAP responses, XBRL financial data, HL7 healthcare records) to avoid tag name collisions.

What Namespaces Look Like

XML
<?xml version="1.0" encoding="UTF-8"?>
<catalog xmlns="http://example.com/catalog/v2"
         xmlns:meta="http://example.com/metadata"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

    <meta:created>2024-09-15</meta:created>
    <meta:author>Jane Smith</meta:author>

    <product id="PROD_001">
        <name>Wireless Headphones</name>
        <price xsi:type="decimal">149.99</price>
    </product>

</catalog>

The xmlns attributes declare namespaces. In the parsed tree, ElementTree prepends the full namespace URI (in curly braces) to every tag name:

Python
import xml.etree.ElementTree as ET

tree = ET.parse("data/catalog_ns.xml")
root = tree.getroot()

# With namespace, the root tag is NOT 'catalog'
print(root.tag)
# '{http://example.com/catalog/v2}catalog'

# Finding elements fails without the namespace:
products = root.findall("product")
print(products)    # [] — empty! Because the tag is actually '{...}product'

# Must include the namespace URI in curly braces:
NS = "http://example.com/catalog/v2"
products = root.findall(f"{{{NS}}}product")
print(len(products))   # 2

The Clean Solution: Namespace Dictionaries

Define a namespace dictionary and use prefixes in your queries:

Python
import xml.etree.ElementTree as ET

tree = ET.parse("data/catalog_ns.xml")
root = tree.getroot()

# Define namespace prefixes for use in XPath
namespaces = {
    "cat":  "http://example.com/catalog/v2",
    "meta": "http://example.com/metadata",
    "xsi":  "http://www.w3.org/2001/XMLSchema-instance"
}

# Use prefix:tag syntax — much cleaner than {uri}tag
products = root.findall("cat:product", namespaces)
print(len(products))    # 2

for product in products:
    name  = product.findtext("cat:name", namespaces=namespaces)
    price = product.findtext("cat:price", namespaces=namespaces)
    print(f"{name}: {price}")

# Get the metadata elements (different namespace)
created = root.findtext("meta:created", namespaces=namespaces)
print(f"Created: {created}")

Extracting Namespaces Automatically

Python
import xml.etree.ElementTree as ET
import re

def extract_namespaces(xml_file: str) -> dict:
    """
    Extract all namespace declarations from an XML file
    and return as a prefix → URI dictionary.

    Useful when you don't know the namespaces in advance.
    """
    namespaces = {}
    # Parse namespace declarations from the raw XML text
    with open(xml_file, "r", encoding="utf-8") as f:
        content = f.read()

    # Extract xmlns declarations
    ns_pattern = r'xmlns(?::(\w+))?=["\']([^"\']+)["\']'
    for match in re.finditer(ns_pattern, content):
        prefix = match.group(1) or "default"
        uri    = match.group(2)
        namespaces[prefix] = uri

    return namespaces

# Usage
ns = extract_namespaces("data/catalog_ns.xml")
print(ns)
# {'default': 'http://example.com/catalog/v2',
#  'meta':    'http://example.com/metadata',
#  'xsi':     'http://www.w3.org/2001/XMLSchema-instance'}

# Rename 'default' to something meaningful
ns["cat"] = ns.pop("default")

Stripping Namespaces for Simple Cases

When you just want to extract data and don’t care about namespace semantics, strip them:

Python
import xml.etree.ElementTree as ET
import re

def strip_namespaces(xml_string: str) -> str:
    """Remove all namespace declarations and prefixes from XML string."""
    # Remove xmlns declarations
    xml_string = re.sub(r'\s+xmlns(?::\w+)?="[^"]+"', "", xml_string)
    # Remove namespace prefixes from tags
    xml_string = re.sub(r'<(/?)(\w+:)', r'<\1', xml_string)
    return xml_string

# Use on small XML strings for quick extraction
clean_xml = strip_namespaces(raw_xml_with_namespaces)
root = ET.fromstring(clean_xml)
# Now tags are namespace-free: root.findall("product") works normally

Note: Only strip namespaces when you control the XML and know it’s safe to do so. In production systems working with standard XML vocabularies (HL7, XBRL), preserve namespaces for correctness.

lxml: The Power Parser

lxml is a fast, full-featured XML/HTML library that wraps the libxml2 C library. It’s API-compatible with ElementTree but adds full XPath 1.0 support, XSLT transformations, schema validation, and significantly better performance on large files.

Python
pip install lxml

lxml vs. ElementTree: When to Use Each

FeatureElementTreelxml
InstallationBuilt-in, no installRequires pip install lxml
SpeedGood5-10× faster on large files
XPath supportSubset onlyFull XPath 1.0
XSLTNoYes
XML schema validationNoYes (XSD)
HTML parsingLimitedFull (via lxml.html)
API compatibilityStandardElementTree-compatible

Use ElementTree for: simple files, standard formats, production code requiring no extra dependencies. Use lxml for: large files, complex XPath queries, namespace-heavy data, XBRL/HL7/SOAP parsing.

lxml Basics

Python
from lxml import etree

# Parse from file
tree = etree.parse("data/catalog.xml")
root = tree.getroot()

# Parse from string
root = etree.fromstring(xml_string.encode("utf-8"))

# API is mostly identical to ElementTree
print(root.tag)
products = root.findall("product")

# lxml-specific: XPath method on elements
prices = root.xpath("//price")           # All <price> elements anywhere
price_texts = root.xpath("//price/text()") # Text of all <price> elements
featured = root.xpath("//product[@featured='true']")

# XPath with predicates and functions
expensive = root.xpath("//product[number(price) > 100]")
names = root.xpath("//product/name/text()")
print(names)    # ['Wireless Headphones', 'USB-C Hub']

# XPath with namespaces
ns = {"cat": "http://example.com/catalog/v2"}
products = root.xpath("//cat:product", namespaces=ns)
names = root.xpath("//cat:product/cat:name/text()", namespaces=ns)

Full XPath 1.0 Expressions in lxml

Python
from lxml import etree

root = etree.parse("data/catalog.xml").getroot()

# All nodes anywhere in the document (// = anywhere)
all_specs = root.xpath("//spec")

# Attribute selection
all_ids = root.xpath("//product/@id")           # Get attribute values
print(all_ids)   # ['PROD_001', 'PROD_002']

# Text extraction
all_names = root.xpath("//product/name/text()")
all_prices = root.xpath("//price/text()")

# Conditional selection
expensive = root.xpath("//product[price > 100]")
specific   = root.xpath("//product[@id='PROD_001']")

# Count
product_count = root.xpath("count(//product)")
print(int(product_count))    # 2

# String functions
# Products whose name contains 'USB'
usb_products = root.xpath("//product[contains(name, 'USB')]")

# Normalize-space to handle whitespace
names_clean = root.xpath("//name[normalize-space(.) != '']")

# Position functions
first = root.xpath("//product[position()=1]")
last  = root.xpath("//product[last()]")

# Combining multiple conditions
# Featured products with price over 100
premium_expensive = root.xpath("//product[@featured='true' and price > 100]")

Streaming Large XML Files: iterparse

For large XML files (hundreds of megabytes or gigabytes), loading the full document into memory with parse() is impractical. Both ElementTree and lxml provide iterparse() — an event-driven, streaming parser that processes the document element by element.

iterparse Basics

Python
import xml.etree.ElementTree as ET

def stream_xml_records(filepath: str,
                        record_tag: str,
                        fields: dict,
                        max_records: int = None) -> list:
    """
    Stream-parse a large XML file and extract records efficiently.

    Processes the file without loading it entirely into memory by
    using iterparse to parse event-by-event.

    Parameters
    ----------
    filepath : str
        Path to the XML file.
    record_tag : str
        The XML tag that represents one record (e.g., 'product', 'order').
    fields : dict
        Mapping of {output_column_name: xpath_expression}.
        Simple path: 'name' (child text), '@id' (attribute).
    max_records : int, optional
        Stop after this many records (for testing).

    Returns
    -------
    list of dict
        Extracted records.
    """
    records = []
    count = 0

    # iterparse yields (event, element) tuples
    # 'end' events fire after an element and all its children have been parsed
    for event, elem in ET.iterparse(filepath, events=["end"]):

        if elem.tag != record_tag:
            elem.clear()   # Free memory for elements we don't need
            continue

        # Extract this record
        record = {}
        for col_name, field_spec in fields.items():
            if field_spec.startswith("@"):
                # Attribute
                record[col_name] = elem.get(field_spec[1:])
            else:
                # Child element text
                child = elem.find(field_spec)
                record[col_name] = child.text if child is not None else None

        records.append(record)
        count += 1

        # CRUCIAL: Clear the element to free memory
        # Also clear the parent to prevent memory buildup
        elem.clear()

        if max_records and count >= max_records:
            break

        if count % 100_000 == 0:
            print(f"Processed {count:,} records...")

    print(f"Total records: {len(records):,}")
    return records


# Parse a 2GB product catalog with millions of entries
fields = {
    "product_id":   "@id",
    "name":         "name",
    "price":        "price",
    "category":     "category"
}

records = stream_xml_records(
    "data/large_catalog.xml",
    record_tag="product",
    fields=fields
)
df = pd.DataFrame(records)
df["price"] = pd.to_numeric(df["price"], errors="coerce")

The Memory Management Pattern

The most important thing about iterparse is clearing elements after processing to prevent memory from accumulating:

Python
import xml.etree.ElementTree as ET

# WRONG — memory grows with every element
for event, elem in ET.iterparse("large_file.xml", events=["end"]):
    if elem.tag == "record":
        process(elem)
        # Without clear(), elem stays in memory even though we're done with it!

# CORRECT — memory stays bounded
for event, elem in ET.iterparse("large_file.xml", events=["end"]):
    if elem.tag == "record":
        process(elem)
        elem.clear()   # Free memory immediately after processing

        # Also remove parent reference to prevent memory leaks
        # (needed for deep documents)
        while elem.getprevious() is not None:  # lxml-specific
            del elem.getparent()[0]

Real-World Example: Parsing an RSS Feed

RSS feeds are a common XML format — here’s a complete parser:

Python
import xml.etree.ElementTree as ET
import pandas as pd
import requests
from datetime import datetime

def parse_rss_feed(url: str, max_items: int = None) -> pd.DataFrame:
    """
    Parse an RSS feed and return articles as a DataFrame.

    Parameters
    ----------
    url : str
        URL of the RSS feed (must return valid RSS/Atom XML).
    max_items : int, optional
        Maximum number of items to return.

    Returns
    -------
    pd.DataFrame
        Articles with title, link, description, pub_date, and author.
    """
    response = requests.get(url, timeout=15)
    response.raise_for_status()

    root = ET.fromstring(response.content)

    # RSS structure: <rss> → <channel> → <item>
    channel = root.find("channel")
    if channel is None:
        # Try Atom format: <feed> → <entry>
        ns = {"atom": "http://www.w3.org/2005/Atom"}
        entries = root.findall("atom:entry", ns)
        items = entries
        is_atom = True
    else:
        items = channel.findall("item")
        is_atom = False

    records = []
    for item in items[:max_items]:
        if is_atom:
            ns = {"atom": "http://www.w3.org/2005/Atom"}
            record = {
                "title":       item.findtext("atom:title", namespaces=ns),
                "link":        item.findtext("atom:id", namespaces=ns),
                "description": item.findtext("atom:summary", namespaces=ns),
                "pub_date":    item.findtext("atom:published", namespaces=ns),
                "author":      item.findtext("atom:author/atom:name", namespaces=ns)
            }
        else:
            # Standard RSS 2.0
            dc_ns = "http://purl.org/dc/elements/1.1/"
            record = {
                "title":       item.findtext("title"),
                "link":        item.findtext("link"),
                "description": item.findtext("description"),
                "pub_date":    item.findtext("pubDate"),
                "author":      (item.findtext("author") or
                                item.findtext(f"{{{dc_ns}}}creator"))
            }

        records.append(record)

    df = pd.DataFrame(records)

    # Clean description text (often contains HTML)
    if "description" in df.columns:
        import html
        df["description"] = df["description"].apply(
            lambda x: html.unescape(ET.tostring(
                ET.fromstring(f"<d>{x}</d>"), method="text"
            ).decode()) if x and "<" in x else x
        )

    # Parse dates
    for date_format in ["%a, %d %b %Y %H:%M:%S %z",
                         "%a, %d %b %Y %H:%M:%S %Z",
                         "%Y-%m-%dT%H:%M:%SZ"]:
        try:
            df["pub_date"] = pd.to_datetime(df["pub_date"], format=date_format)
            break
        except (ValueError, TypeError):
            continue

    print(f"Parsed {len(df)} articles from feed")
    return df


# Usage — works with any standard RSS feed
# bbc_news = parse_rss_feed("https://feeds.bbci.co.uk/news/rss.xml", max_items=20)
# print(bbc_news[["title", "pub_date"]].head(10))

Real-World Example: XBRL Financial Data

XBRL (eXtensible Business Reporting Language) is XML-based financial reporting used by the SEC and other regulatory bodies. It’s namespace-heavy and requires lxml for practical parsing:

Python
from lxml import etree
import pandas as pd

def parse_xbrl_facts(filepath: str) -> pd.DataFrame:
    """
    Extract financial facts from an XBRL instance document.
    Returns a DataFrame with element name, value, context, and unit.
    """
    tree = etree.parse(filepath)
    root = tree.getroot()

    # XBRL commonly used namespaces
    ns = {
        "xbrl": "http://www.xbrl.org/2003/instance",
        "us-gaap": "http://fasb.org/us-gaap/2023-01-31",
        "dei":  "http://xbrl.sec.gov/dei/2023"
    }

    facts = []

    # Extract all non-tuple, non-context/unit elements as facts
    for elem in root.iter():
        # Skip namespace declarations and structural elements
        tag = etree.QName(elem.tag)
        if tag.namespace in (
            "http://www.xbrl.org/2003/instance",
            "http://www.w3.org/2001/XMLSchema-instance"
        ):
            continue

        if elem.text and elem.text.strip():
            facts.append({
                "namespace":   tag.namespace,
                "element":     tag.localname,
                "value":       elem.text.strip(),
                "context_ref": elem.get("contextRef"),
                "unit_ref":    elem.get("unitRef"),
                "decimals":    elem.get("decimals")
            })

    df = pd.DataFrame(facts)
    df["value_numeric"] = pd.to_numeric(df["value"], errors="coerce")
    return df

Writing XML in Python

For completeness — building XML documents programmatically:

Python
import xml.etree.ElementTree as ET

def build_product_catalog(products: list) -> str:
    """
    Build an XML catalog document from a list of product dicts.
    Returns the XML as a formatted string.
    """
    root = ET.Element("catalog", version="2.0")

    for p in products:
        product_elem = ET.SubElement(root, "product",
                                      id=p["product_id"],
                                      featured=str(p.get("featured", False)).lower())

        ET.SubElement(product_elem, "name").text       = p["name"]
        ET.SubElement(product_elem, "category").text   = p.get("category", "")

        price_elem = ET.SubElement(product_elem, "price", currency="USD")
        price_elem.text = str(p["price"])

        if p.get("tags"):
            tags_elem = ET.SubElement(product_elem, "tags")
            for tag in p["tags"]:
                ET.SubElement(tags_elem, "tag").text = tag

    # Format output (indent requires Python 3.9+)
    ET.indent(root, space="    ")

    return ET.tostring(root, encoding="unicode", xml_declaration=True)


# Build and save
products = [
    {"product_id": "PROD_001", "name": "Wireless Headphones",
     "category": "Electronics", "price": 149.99,
     "featured": True, "tags": ["audio", "wireless"]},
    {"product_id": "PROD_002", "name": "USB-C Hub",
     "category": "Electronics", "price": 49.99,
     "featured": False, "tags": ["usb", "hub"]},
]

xml_output = build_product_catalog(products)

with open("output/catalog.xml", "w", encoding="utf-8") as f:
    f.write(xml_output)

print(xml_output[:500])

pandas read_xml(): Direct XML to DataFrame

pandas 1.3+ added pd.read_xml(), which reads flat or mildly nested XML directly into a DataFrame:

Python
import pandas as pd

# Read XML with a flat structure
df = pd.read_xml("data/customers.xml")

# Specify the element to use as rows with xpath
df = pd.read_xml(
    "data/catalog.xml",
    xpath="//product",            # XPath to the repeating element
    namespaces={"cat": "http://example.com/catalog/v2"}
)

# With namespace (requires lxml)
df = pd.read_xml(
    "data/ns_catalog.xml",
    xpath="//cat:product",
    namespaces={"cat": "http://example.com/catalog/v2"}
)

# From a string
df = pd.read_xml(xml_string, xpath="//customer")

# Specify parser
df = pd.read_xml("data/catalog.xml", parser="lxml")   # Faster for large files

print(df.head())

pd.read_xml() works well for documents with a clean, flat repeated element structure. For documents with heavy nesting, mixed content, or complex namespace handling, the manual ElementTree approach gives more control.

Best Practices for XML Parsing in Data Science

Always Use findtext() Instead of find().text

Python
# FRAGILE: AttributeError if element doesn't exist
name = product.find("name").text

# SAFE: Returns None (or default) if element not found
name = product.findtext("name")
name = product.findtext("name", default="Unknown")

Use get() for Attributes

Python
# FRAGILE: KeyError if attribute doesn't exist
product_id = product.attrib["id"]

# SAFE: Returns None or default if attribute not found
product_id = product.get("id")
product_id = product.get("id", "UNKNOWN")

Validate Structure Before Processing

Python
def safe_extract_products(root: ET.Element) -> list:
    """Extract products with validation before processing."""
    if root.tag not in ("catalog", "products"):
        raise ValueError(f"Unexpected root tag: {root.tag}")

    products = root.findall("product")
    if not products:
        raise ValueError("No <product> elements found in document")

    required_children = {"name", "price"}

    records = []
    for i, product in enumerate(products):
        children_present = {child.tag for child in product}
        missing = required_children - children_present
        if missing:
            print(f"Warning: product {i} missing required fields: {missing}")
            continue

        records.append({
            "id":    product.get("id"),
            "name":  product.findtext("name"),
            "price": product.findtext("price")
        })

    return records

Handle Encoding Explicitly

Python
# Always specify encoding when reading XML from files
with open("data/catalog.xml", "r", encoding="utf-8") as f:
    xml_content = f.read()
root = ET.fromstring(xml_content)

# Or let lxml detect it from the XML declaration
from lxml import etree
tree = etree.parse("data/catalog.xml")   # lxml handles encoding automatically

Summary

XML parsing in Python has two primary tools: the built-in xml.etree.ElementTree module for standard use cases and the lxml library for large files, full XPath support, and namespace-heavy documents. Both work with the same fundamental concept — a tree of Element objects with .tag, .attrib, .text, and child relationships — making skills transfer directly between the two.

The most important practical skills are: finding elements with find() and findall(), extracting text safely with findtext(), extracting attributes safely with .get(), using namespace dictionaries to query namespace-qualified documents, flattening nested XML structures into pandas DataFrames using record-by-record iteration, and streaming large files with iterparse to avoid memory exhaustion.

XML will never be as convenient as JSON for data science work — its verbosity, namespace complexity, and lack of native array handling add friction. But when you encounter it in healthcare, finance, enterprise systems, or government data, the techniques in this article give you everything you need to extract clean, analyzable data from even the most complex XML documents.

Key Takeaways

  • ET.parse("file.xml").getroot() parses a file into an Element tree; ET.fromstring(xml_string) parses a string — both return a root Element with .tag, .attrib, .text, and child elements
  • element.find("tag") returns the first matching direct child or None; element.findall("tag") returns all matching direct children as a list; element.iter("tag") recursively finds all descendants with that tag
  • Always use element.findtext("tag") instead of element.find("tag").textfindtext() safely returns None when the element doesn’t exist instead of raising AttributeError
  • Always use element.get("attr") instead of element.attrib["attr"] for attributes — .get() safely returns None for missing attributes
  • XML namespaces appear as {uri}tag in parsed element tags — define a namespace dictionary and use findall("prefix:tag", namespaces) syntax to query namespace-qualified documents cleanly
  • lxml provides full XPath 1.0 with element.xpath("//product[@id='X']/name/text()") and is 5-10× faster than ElementTree for large files — prefer it for XBRL, HL7, SOAP, and other complex XML vocabularies
  • iterparse streams large XML files event-by-event without loading everything into memory — always call elem.clear() after processing each record to prevent memory accumulation
  • pd.read_xml() handles simple flat XML structures in one line; for complex nesting, build the DataFrame manually by iterating over elements and extracting fields with findtext() and .get()
Share:
Subscribe
Notify of
0 Comments
Inline Feedbacks
View all comments

Discover More

Introduction to Model Evaluation Metrics

Master machine learning evaluation metrics including accuracy, precision, recall, F1-score, ROC-AUC, RMSE, and more with…

Probability Theory Fundamentals for Machine Learning

Master probability theory fundamentals essential for machine learning. Learn probability distributions, conditional probability, Bayes’ theorem,…

Parallel Circuits Demystified: Creating Multiple Paths for Current

Master parallel circuits with this comprehensive guide. Learn how components connect in parallel, calculate current…

Samsung Announces Massive AI Expansion Targeting 800 Million Mobile Devices in 2026

Samsung announces aggressive AI strategy to double Galaxy AI-enabled devices to 800 million by 2026.…

JavaScript Functions: Declaration, Invocation and Parameters

Learn about JavaScript functions, including declarations, invocations, parameters, and handling asynchronous tasks with callbacks, promises…

What is Continuity Testing and Why is it Your Best Debugging Friend?

Learn what continuity testing is, how to use it with a multimeter, and why it’s…

Click For More
0
Would love your thoughts, please comment.x
()
x