XML (eXtensible Markup Language) is parsed in Python primarily using the built-in xml.etree.ElementTree module, which loads an XML document into a navigable tree of Element objects. Each element has a tag name, a dictionary of attributes, text content, and child elements. You navigate with element.find("tag") for the first matching child, element.findall("tag") for all matching children, and XPath expressions for more complex queries — then extract text with element.text and attributes with element.get("attribute_name"). For complex, namespace-heavy, or very large XML, the lxml library provides faster parsing, full XPath 1.0 support, and streaming parsers.
Introduction
While JSON has become the dominant format for modern web APIs, XML remains deeply embedded in many industries and systems that data scientists regularly encounter. Healthcare systems use HL7 FHIR and CDA standards in XML. Financial data services use FIX protocol and XBRL in XML. Government open data portals commonly distribute data in XML. Microsoft Office documents — Word, Excel, PowerPoint — are ZIP archives of XML files. RSS and Atom feeds are XML. Enterprise software like SAP, Salesforce, and Oracle export data in XML. If you work in any of these domains, XML parsing is a required skill.
XML and JSON share the same fundamental concept — both are text-based formats for representing hierarchical, self-describing data — but they look and behave differently. Where JSON uses curly braces and brackets to define structure, XML uses opening and closing tags. Where JSON has no formal validation mechanism in the format itself, XML has schemas (DTD and XSD) that rigorously define what’s valid. And where JSON flattened quickly into Python’s json module, XML has spawned multiple parsing approaches — DOM, SAX, ElementTree, and XPath — each suited to different use cases.
This article teaches you everything you need to parse XML data in Python for data science purposes. We’ll cover the XML format itself, Python’s built-in ElementTree module for standard parsing, XPath for powerful querying, the lxml library for advanced use cases, handling namespaces (XML’s most confusing feature), converting XML to pandas DataFrames, and processing large XML files without loading everything into memory.
XML Fundamentals: Structure and Syntax
Before writing parsing code, you need to understand XML’s structure. Every XML concept maps to something concrete in the Python objects you’ll work with.
Anatomy of an XML Document
<?xml version="1.0" encoding="UTF-8"?>
<!-- This is a comment -->
<catalog version="2.0" generated="2024-09-15">
<product id="PROD_001" featured="true">
<name>Wireless Headphones</name>
<category>Electronics</category>
<price currency="USD">149.99</price>
<description>Premium noise-cancelling headphones with 40-hour battery.</description>
<specs>
<spec name="battery_life">40 hours</spec>
<spec name="connectivity">Bluetooth 5.0</spec>
<spec name="weight">250g</spec>
</specs>
<tags>
<tag>audio</tag>
<tag>wireless</tag>
<tag>premium</tag>
</tags>
</product>
<product id="PROD_002" featured="false">
<name>USB-C Hub</name>
<category>Electronics</category>
<price currency="USD">49.99</price>
<description>7-in-1 USB-C hub with HDMI, USB 3.0, and SD card reader.</description>
<specs>
<spec name="ports">7</spec>
<spec name="hdmi_output">4K@30Hz</spec>
</specs>
<tags>
<tag>usb</tag>
<tag>hub</tag>
</tags>
</product>
</catalog>Key XML concepts:
- Declaration:
<?xml version="1.0" encoding="UTF-8"?>— optional header specifying version and encoding - Root element:
<catalog>— every XML document has exactly one root element that contains all others - Elements: Defined by opening (
<name>) and closing (</name>) tags, or self-closing (<br/>) - Attributes: Key-value pairs inside the opening tag:
id="PROD_001",currency="USD" - Text content: The text between opening and closing tags:
Wireless Headphones - Children: Elements nested inside another element (
<specs>contains multiple<spec>elements) - Comments:
<!-- This is a comment -->— ignored by parsers - Tail text: Text that appears after a closing tag but before the next sibling tag (an ElementTree-specific concept we’ll cover)
XML vs. JSON: The Structural Difference
The same data in both formats illustrates the key structural differences:
{
"product": {
"id": "PROD_001",
"name": "Wireless Headphones",
"price": 149.99
}
}<product id="PROD_001">
<name>Wireless Headphones</name>
<price currency="USD">149.99</price>
</product>In XML, metadata can be stored as attributes (like id or currency) while the main content goes in element text. In JSON, everything is a key-value pair at the same level. This dual storage mechanism (attributes vs. text) is one reason XML requires more thought to parse — you need to know whether a value you want is an attribute or text content.
Python’s Built-in xml.etree.ElementTree
xml.etree.ElementTree is Python’s standard library XML parser. It implements the ElementTree model — loading the entire XML document into memory as a tree of Element objects, which you then navigate and query.
Parsing an XML String or File
import xml.etree.ElementTree as ET
# Parse from a string
xml_string = """
<catalog version="2.0">
<product id="PROD_001">
<name>Wireless Headphones</name>
<price currency="USD">149.99</price>
</product>
<product id="PROD_002">
<name>USB-C Hub</name>
<price currency="USD">49.99</price>
</product>
</catalog>
"""
root = ET.fromstring(xml_string) # Parse string → root Element
# Parse from a file
tree = ET.parse("data/catalog.xml") # Parse file → ElementTree object
root = tree.getroot() # Get the root Element
print(root.tag) # 'catalog'
print(root.attrib) # {'version': '2.0'}
print(len(root)) # 2 (number of direct children)The Element Object
Every node in the parsed XML tree is an Element object with these key properties:
# Given: <product id="PROD_001" featured="true">
product = root[0] # First child of root
# Tag name
print(product.tag) # 'product'
# Attributes dictionary
print(product.attrib) # {'id': 'PROD_001', 'featured': 'true'}
print(product.get("id")) # 'PROD_001'
print(product.get("color")) # None (safe — returns None for missing attributes)
print(product.get("color", "unknown")) # 'unknown' (with default)
# Text content (direct text between opening and closing tags)
name_elem = product.find("name")
print(name_elem.text) # 'Wireless Headphones'
# Tail text (text after the element's closing tag, before the next sibling)
# This is rarely needed but important to know about
# In <a>text</a> tail text, "tail text" is the .tail of element <a>
# Children
for child in product:
print(f"{child.tag}: {child.text}")
# name: Wireless Headphones
# category: Electronics
# price: 149.99
# description: Premium noise-cancelling headphones...
# specs: None ← specs has child elements, not text
# tags: None ← tags has child elements, not text
Navigating the Tree: find() and findall()
# find(): returns the FIRST matching child element (or None)
# findall(): returns a LIST of ALL matching child elements
# Find immediate children by tag name
name_elem = product.find("name") # First <name> child
price_elem = product.find("price") # First <price> child
print(name_elem.text) # 'Wireless Headphones'
print(price_elem.text) # '149.99'
print(price_elem.get("currency")) # 'USD'
# Find all products at the root level
all_products = root.findall("product")
print(f"Total products: {len(all_products)}") # 2
# Find nested elements (one level at a time)
specs = product.find("specs")
all_specs = specs.findall("spec")
for spec in all_specs:
print(f"{spec.get('name')}: {spec.text}")
# battery_life: 40 hours
# connectivity: Bluetooth 5.0
# weight: 250g
# Find using a simple path (dot-separated tags)
# This navigates: product → specs → spec
all_spec_elems = product.findall("specs/spec")
print(len(all_spec_elems)) # 3Iterating Over Children
# Iterate over direct children
for child in product:
print(f"Tag: {child.tag}, Text: {child.text}")
# iter(): recursive iteration over ALL descendants (all levels deep)
for elem in product.iter():
if elem.text and elem.text.strip():
print(f"{elem.tag}: {elem.text.strip()}")
# iter() with a tag filter — find ALL elements with a given tag, anywhere in the tree
for spec in root.iter("spec"): # Find all <spec> elements in the entire document
print(f"{spec.get('name')}: {spec.text}")
for tag_elem in root.iter("tag"):
print(tag_elem.text)
# audio, wireless, premium, usb, hubfind() vs. findall() vs. iter(): When to Use Each
| Method | Scope | Returns | Use When |
|---|---|---|---|
elem.find("tag") | Direct children only | First match or None | Getting a single known child element |
elem.findall("tag") | Direct children only | List of matches | Getting all children with a specific tag |
elem.findall("a/b") | Two-level path | List of matches | Getting specific nested elements |
elem.iter("tag") | All descendants (recursive) | Iterator | Finding all occurrences anywhere in the tree |
elem.iter() | All descendants (recursive) | Iterator | Traversing the entire subtree |
XPath: Powerful XML Querying
XPath is a query language for XML — like SQL for databases but for navigating and selecting nodes in an XML tree. Python’s ElementTree supports a useful subset of XPath, and lxml supports the full XPath 1.0 standard.
XPath Expressions in ElementTree
import xml.etree.ElementTree as ET
# Parse the full catalog XML
tree = ET.parse("data/catalog.xml")
root = tree.getroot()
# Basic XPath syntax in ElementTree:
# // means "anywhere in the document"
# Not supported in ElementTree — use .iter() instead
# . means "current element"
# * matches any tag name
# Find all <product> children directly under root
products = root.findall("product")
# Find all <spec> elements that are grandchildren (product/specs/spec)
# Using path notation:
all_specs = root.findall("product/specs/spec")
# Find by attribute value — XPath predicate syntax
featured_products = root.findall("product[@featured='true']")
print(f"Featured products: {len(featured_products)}")
# Find specific product by ID attribute
prod_001 = root.find("product[@id='PROD_001']")
print(prod_001.find("name").text) # 'Wireless Headphones'
# Find products where a specific child exists
# (ElementTree supports this limited form)
products_with_specs = root.findall("product[specs]")
# Find the first child with a specific tag
first_product = root.find("product")
# Wildcard: find any direct child named 'name'
any_name = root.findall("*/name") # Searches children's children for <name>
# Get the text of all <name> elements under any direct child
names = [elem.text for elem in root.findall("*/name")]
print(names) # ['Wireless Headphones', 'USB-C Hub']XPath Predicates for Filtering
# Filter by attribute presence
products_with_featured_attr = root.findall("product[@featured]")
# Filter by attribute value
featured = root.findall("product[@featured='true']")
not_featured = root.findall("product[@featured='false']")
# Filter by child element presence
has_description = root.findall("product[description]")
# Position predicates (ElementTree supports limited position)
first = root.find("product[1]") # First <product> (1-indexed in XPath!)
second = root.find("product[2]") # Second <product>
last = root.find("product[last()]") # Last <product>Extracting XML to a pandas DataFrame
The goal for most data science work is transforming XML into a flat DataFrame. Here are the patterns for different XML structures.
Pattern 1: Flat Repeated Elements
When the XML has a list of elements with the same structure:
import xml.etree.ElementTree as ET
import pandas as pd
xml_data = """
<customers>
<customer id="CUST_001" premium="true">
<name>Jane Smith</name>
<email>jane.smith@email.com</email>
<city>Austin</city>
<lifetime_value>3847.50</lifetime_value>
</customer>
<customer id="CUST_002" premium="false">
<name>Bob Johnson</name>
<email>bob.j@email.com</email>
<city>Seattle</city>
<lifetime_value>892.00</lifetime_value>
</customer>
<customer id="CUST_003" premium="true">
<name>Alice Williams</name>
<email>alice.w@email.com</email>
<city>Chicago</city>
<lifetime_value>2341.75</lifetime_value>
</customer>
</customers>
"""
root = ET.fromstring(xml_data)
records = []
for customer in root.findall("customer"):
record = {
# Extract attributes
"customer_id": customer.get("id"),
"is_premium": customer.get("premium") == "true",
# Extract child element text (with safe fallback using find())
"name": customer.findtext("name"),
"email": customer.findtext("email"),
"city": customer.findtext("city"),
"lifetime_value": customer.findtext("lifetime_value")
}
records.append(record)
df = pd.DataFrame(records)
# Cast types
df["lifetime_value"] = pd.to_numeric(df["lifetime_value"])
print(df)
print(df.dtypes)Note: element.findtext("tag") is a convenient shorthand for element.find("tag").text that safely returns None instead of raising AttributeError when the element doesn’t exist.
Pattern 2: Nested Elements Flattened into Columns
When child elements contain nested sub-elements you need as separate columns:
xml_products = """
<catalog>
<product id="PROD_001">
<name>Wireless Headphones</name>
<price currency="USD">149.99</price>
<specs>
<spec name="battery_life">40 hours</spec>
<spec name="connectivity">Bluetooth 5.0</spec>
<spec name="weight">250g</spec>
</specs>
<tags>
<tag>audio</tag>
<tag>wireless</tag>
</tags>
</product>
<product id="PROD_002">
<name>USB-C Hub</name>
<price currency="USD">49.99</price>
<specs>
<spec name="ports">7</spec>
<spec name="hdmi_output">4K@30Hz</spec>
</specs>
<tags>
<tag>usb</tag>
<tag>hub</tag>
</tags>
</product>
</catalog>
"""
root = ET.fromstring(xml_products)
records = []
for product in root.findall("product"):
# Build base record from attributes and simple children
record = {
"product_id": product.get("id"),
"name": product.findtext("name"),
"price": float(product.findtext("price", "0")),
"currency": product.find("price").get("currency") if product.find("price") is not None else None,
}
# Flatten key-value spec elements into separate columns
specs_elem = product.find("specs")
if specs_elem is not None:
for spec in specs_elem.findall("spec"):
col_name = f"spec_{spec.get('name')}"
record[col_name] = spec.text
# Collect array-like elements into a joined string
tags = [tag.text for tag in product.findall("tags/tag")]
record["tags"] = "|".join(tags) # Join as pipe-separated string
records.append(record)
df_products = pd.DataFrame(records)
print(df_products)
print(df_products.columns.tolist())
# ['product_id', 'name', 'price', 'currency',
# 'spec_battery_life', 'spec_connectivity', 'spec_weight',
# 'spec_ports', 'spec_hdmi_output', 'tags']Pattern 3: Hierarchical Data — One Row Per Leaf
When the XML has parent-child relationships and you need one row per child (like orders → line items):
xml_orders = """
<orders>
<order id="ORD_1001" customer_id="CUST_001" date="2024-09-01" status="delivered">
<item product_id="PROD_001" quantity="1" unit_price="149.99"/>
<item product_id="PROD_007" quantity="2" unit_price="39.99"/>
</order>
<order id="ORD_1002" customer_id="CUST_002" date="2024-09-03" status="shipped">
<item product_id="PROD_002" quantity="1" unit_price="49.99"/>
<item product_id="PROD_003" quantity="1" unit_price="89.99"/>
<item product_id="PROD_006" quantity="3" unit_price="29.99"/>
</order>
</orders>
"""
root = ET.fromstring(xml_orders)
# One row per order item, with parent order fields repeated
line_items = []
for order in root.findall("order"):
order_id = order.get("id")
customer_id = order.get("customer_id")
order_date = order.get("date")
status = order.get("status")
for item in order.findall("item"):
quantity = int(item.get("quantity", 0))
unit_price = float(item.get("unit_price", 0))
line_items.append({
"order_id": order_id,
"customer_id": customer_id,
"order_date": order_date,
"status": status,
"product_id": item.get("product_id"),
"quantity": quantity,
"unit_price": unit_price,
"line_total": quantity * unit_price
})
df_items = pd.DataFrame(line_items)
df_items["order_date"] = pd.to_datetime(df_items["order_date"])
print(df_items)
print(f"\nTotal order value: ${df_items['line_total'].sum():.2f}")| order_id | customer_id | order_date | status | product_id | quantity | unit_price | line_total |
|---|---|---|---|---|---|---|---|
| ORD_1001 | CUST_001 | 2024-09-01 | delivered | PROD_001 | 1 | 149.99 | 149.99 |
| ORD_1001 | CUST_001 | 2024-09-01 | delivered | PROD_007 | 2 | 39.99 | 79.98 |
| ORD_1002 | CUST_002 | 2024-09-03 | shipped | PROD_002 | 1 | 49.99 | 49.99 |
Handling XML Namespaces
Namespaces are XML’s most commonly confusing feature. They’re used when combining XML vocabularies from multiple sources (e.g., SOAP responses, XBRL financial data, HL7 healthcare records) to avoid tag name collisions.
What Namespaces Look Like
<?xml version="1.0" encoding="UTF-8"?>
<catalog xmlns="http://example.com/catalog/v2"
xmlns:meta="http://example.com/metadata"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<meta:created>2024-09-15</meta:created>
<meta:author>Jane Smith</meta:author>
<product id="PROD_001">
<name>Wireless Headphones</name>
<price xsi:type="decimal">149.99</price>
</product>
</catalog>The xmlns attributes declare namespaces. In the parsed tree, ElementTree prepends the full namespace URI (in curly braces) to every tag name:
import xml.etree.ElementTree as ET
tree = ET.parse("data/catalog_ns.xml")
root = tree.getroot()
# With namespace, the root tag is NOT 'catalog'
print(root.tag)
# '{http://example.com/catalog/v2}catalog'
# Finding elements fails without the namespace:
products = root.findall("product")
print(products) # [] — empty! Because the tag is actually '{...}product'
# Must include the namespace URI in curly braces:
NS = "http://example.com/catalog/v2"
products = root.findall(f"{{{NS}}}product")
print(len(products)) # 2The Clean Solution: Namespace Dictionaries
Define a namespace dictionary and use prefixes in your queries:
import xml.etree.ElementTree as ET
tree = ET.parse("data/catalog_ns.xml")
root = tree.getroot()
# Define namespace prefixes for use in XPath
namespaces = {
"cat": "http://example.com/catalog/v2",
"meta": "http://example.com/metadata",
"xsi": "http://www.w3.org/2001/XMLSchema-instance"
}
# Use prefix:tag syntax — much cleaner than {uri}tag
products = root.findall("cat:product", namespaces)
print(len(products)) # 2
for product in products:
name = product.findtext("cat:name", namespaces=namespaces)
price = product.findtext("cat:price", namespaces=namespaces)
print(f"{name}: {price}")
# Get the metadata elements (different namespace)
created = root.findtext("meta:created", namespaces=namespaces)
print(f"Created: {created}")Extracting Namespaces Automatically
import xml.etree.ElementTree as ET
import re
def extract_namespaces(xml_file: str) -> dict:
"""
Extract all namespace declarations from an XML file
and return as a prefix → URI dictionary.
Useful when you don't know the namespaces in advance.
"""
namespaces = {}
# Parse namespace declarations from the raw XML text
with open(xml_file, "r", encoding="utf-8") as f:
content = f.read()
# Extract xmlns declarations
ns_pattern = r'xmlns(?::(\w+))?=["\']([^"\']+)["\']'
for match in re.finditer(ns_pattern, content):
prefix = match.group(1) or "default"
uri = match.group(2)
namespaces[prefix] = uri
return namespaces
# Usage
ns = extract_namespaces("data/catalog_ns.xml")
print(ns)
# {'default': 'http://example.com/catalog/v2',
# 'meta': 'http://example.com/metadata',
# 'xsi': 'http://www.w3.org/2001/XMLSchema-instance'}
# Rename 'default' to something meaningful
ns["cat"] = ns.pop("default")Stripping Namespaces for Simple Cases
When you just want to extract data and don’t care about namespace semantics, strip them:
import xml.etree.ElementTree as ET
import re
def strip_namespaces(xml_string: str) -> str:
"""Remove all namespace declarations and prefixes from XML string."""
# Remove xmlns declarations
xml_string = re.sub(r'\s+xmlns(?::\w+)?="[^"]+"', "", xml_string)
# Remove namespace prefixes from tags
xml_string = re.sub(r'<(/?)(\w+:)', r'<\1', xml_string)
return xml_string
# Use on small XML strings for quick extraction
clean_xml = strip_namespaces(raw_xml_with_namespaces)
root = ET.fromstring(clean_xml)
# Now tags are namespace-free: root.findall("product") works normallyNote: Only strip namespaces when you control the XML and know it’s safe to do so. In production systems working with standard XML vocabularies (HL7, XBRL), preserve namespaces for correctness.
lxml: The Power Parser
lxml is a fast, full-featured XML/HTML library that wraps the libxml2 C library. It’s API-compatible with ElementTree but adds full XPath 1.0 support, XSLT transformations, schema validation, and significantly better performance on large files.
pip install lxmllxml vs. ElementTree: When to Use Each
| Feature | ElementTree | lxml |
|---|---|---|
| Installation | Built-in, no install | Requires pip install lxml |
| Speed | Good | 5-10× faster on large files |
| XPath support | Subset only | Full XPath 1.0 |
| XSLT | No | Yes |
| XML schema validation | No | Yes (XSD) |
| HTML parsing | Limited | Full (via lxml.html) |
| API compatibility | Standard | ElementTree-compatible |
Use ElementTree for: simple files, standard formats, production code requiring no extra dependencies. Use lxml for: large files, complex XPath queries, namespace-heavy data, XBRL/HL7/SOAP parsing.
lxml Basics
from lxml import etree
# Parse from file
tree = etree.parse("data/catalog.xml")
root = tree.getroot()
# Parse from string
root = etree.fromstring(xml_string.encode("utf-8"))
# API is mostly identical to ElementTree
print(root.tag)
products = root.findall("product")
# lxml-specific: XPath method on elements
prices = root.xpath("//price") # All <price> elements anywhere
price_texts = root.xpath("//price/text()") # Text of all <price> elements
featured = root.xpath("//product[@featured='true']")
# XPath with predicates and functions
expensive = root.xpath("//product[number(price) > 100]")
names = root.xpath("//product/name/text()")
print(names) # ['Wireless Headphones', 'USB-C Hub']
# XPath with namespaces
ns = {"cat": "http://example.com/catalog/v2"}
products = root.xpath("//cat:product", namespaces=ns)
names = root.xpath("//cat:product/cat:name/text()", namespaces=ns)Full XPath 1.0 Expressions in lxml
from lxml import etree
root = etree.parse("data/catalog.xml").getroot()
# All nodes anywhere in the document (// = anywhere)
all_specs = root.xpath("//spec")
# Attribute selection
all_ids = root.xpath("//product/@id") # Get attribute values
print(all_ids) # ['PROD_001', 'PROD_002']
# Text extraction
all_names = root.xpath("//product/name/text()")
all_prices = root.xpath("//price/text()")
# Conditional selection
expensive = root.xpath("//product[price > 100]")
specific = root.xpath("//product[@id='PROD_001']")
# Count
product_count = root.xpath("count(//product)")
print(int(product_count)) # 2
# String functions
# Products whose name contains 'USB'
usb_products = root.xpath("//product[contains(name, 'USB')]")
# Normalize-space to handle whitespace
names_clean = root.xpath("//name[normalize-space(.) != '']")
# Position functions
first = root.xpath("//product[position()=1]")
last = root.xpath("//product[last()]")
# Combining multiple conditions
# Featured products with price over 100
premium_expensive = root.xpath("//product[@featured='true' and price > 100]")Streaming Large XML Files: iterparse
For large XML files (hundreds of megabytes or gigabytes), loading the full document into memory with parse() is impractical. Both ElementTree and lxml provide iterparse() — an event-driven, streaming parser that processes the document element by element.
iterparse Basics
import xml.etree.ElementTree as ET
def stream_xml_records(filepath: str,
record_tag: str,
fields: dict,
max_records: int = None) -> list:
"""
Stream-parse a large XML file and extract records efficiently.
Processes the file without loading it entirely into memory by
using iterparse to parse event-by-event.
Parameters
----------
filepath : str
Path to the XML file.
record_tag : str
The XML tag that represents one record (e.g., 'product', 'order').
fields : dict
Mapping of {output_column_name: xpath_expression}.
Simple path: 'name' (child text), '@id' (attribute).
max_records : int, optional
Stop after this many records (for testing).
Returns
-------
list of dict
Extracted records.
"""
records = []
count = 0
# iterparse yields (event, element) tuples
# 'end' events fire after an element and all its children have been parsed
for event, elem in ET.iterparse(filepath, events=["end"]):
if elem.tag != record_tag:
elem.clear() # Free memory for elements we don't need
continue
# Extract this record
record = {}
for col_name, field_spec in fields.items():
if field_spec.startswith("@"):
# Attribute
record[col_name] = elem.get(field_spec[1:])
else:
# Child element text
child = elem.find(field_spec)
record[col_name] = child.text if child is not None else None
records.append(record)
count += 1
# CRUCIAL: Clear the element to free memory
# Also clear the parent to prevent memory buildup
elem.clear()
if max_records and count >= max_records:
break
if count % 100_000 == 0:
print(f"Processed {count:,} records...")
print(f"Total records: {len(records):,}")
return records
# Parse a 2GB product catalog with millions of entries
fields = {
"product_id": "@id",
"name": "name",
"price": "price",
"category": "category"
}
records = stream_xml_records(
"data/large_catalog.xml",
record_tag="product",
fields=fields
)
df = pd.DataFrame(records)
df["price"] = pd.to_numeric(df["price"], errors="coerce")The Memory Management Pattern
The most important thing about iterparse is clearing elements after processing to prevent memory from accumulating:
import xml.etree.ElementTree as ET
# WRONG — memory grows with every element
for event, elem in ET.iterparse("large_file.xml", events=["end"]):
if elem.tag == "record":
process(elem)
# Without clear(), elem stays in memory even though we're done with it!
# CORRECT — memory stays bounded
for event, elem in ET.iterparse("large_file.xml", events=["end"]):
if elem.tag == "record":
process(elem)
elem.clear() # Free memory immediately after processing
# Also remove parent reference to prevent memory leaks
# (needed for deep documents)
while elem.getprevious() is not None: # lxml-specific
del elem.getparent()[0]Real-World Example: Parsing an RSS Feed
RSS feeds are a common XML format — here’s a complete parser:
import xml.etree.ElementTree as ET
import pandas as pd
import requests
from datetime import datetime
def parse_rss_feed(url: str, max_items: int = None) -> pd.DataFrame:
"""
Parse an RSS feed and return articles as a DataFrame.
Parameters
----------
url : str
URL of the RSS feed (must return valid RSS/Atom XML).
max_items : int, optional
Maximum number of items to return.
Returns
-------
pd.DataFrame
Articles with title, link, description, pub_date, and author.
"""
response = requests.get(url, timeout=15)
response.raise_for_status()
root = ET.fromstring(response.content)
# RSS structure: <rss> → <channel> → <item>
channel = root.find("channel")
if channel is None:
# Try Atom format: <feed> → <entry>
ns = {"atom": "http://www.w3.org/2005/Atom"}
entries = root.findall("atom:entry", ns)
items = entries
is_atom = True
else:
items = channel.findall("item")
is_atom = False
records = []
for item in items[:max_items]:
if is_atom:
ns = {"atom": "http://www.w3.org/2005/Atom"}
record = {
"title": item.findtext("atom:title", namespaces=ns),
"link": item.findtext("atom:id", namespaces=ns),
"description": item.findtext("atom:summary", namespaces=ns),
"pub_date": item.findtext("atom:published", namespaces=ns),
"author": item.findtext("atom:author/atom:name", namespaces=ns)
}
else:
# Standard RSS 2.0
dc_ns = "http://purl.org/dc/elements/1.1/"
record = {
"title": item.findtext("title"),
"link": item.findtext("link"),
"description": item.findtext("description"),
"pub_date": item.findtext("pubDate"),
"author": (item.findtext("author") or
item.findtext(f"{{{dc_ns}}}creator"))
}
records.append(record)
df = pd.DataFrame(records)
# Clean description text (often contains HTML)
if "description" in df.columns:
import html
df["description"] = df["description"].apply(
lambda x: html.unescape(ET.tostring(
ET.fromstring(f"<d>{x}</d>"), method="text"
).decode()) if x and "<" in x else x
)
# Parse dates
for date_format in ["%a, %d %b %Y %H:%M:%S %z",
"%a, %d %b %Y %H:%M:%S %Z",
"%Y-%m-%dT%H:%M:%SZ"]:
try:
df["pub_date"] = pd.to_datetime(df["pub_date"], format=date_format)
break
except (ValueError, TypeError):
continue
print(f"Parsed {len(df)} articles from feed")
return df
# Usage — works with any standard RSS feed
# bbc_news = parse_rss_feed("https://feeds.bbci.co.uk/news/rss.xml", max_items=20)
# print(bbc_news[["title", "pub_date"]].head(10))Real-World Example: XBRL Financial Data
XBRL (eXtensible Business Reporting Language) is XML-based financial reporting used by the SEC and other regulatory bodies. It’s namespace-heavy and requires lxml for practical parsing:
from lxml import etree
import pandas as pd
def parse_xbrl_facts(filepath: str) -> pd.DataFrame:
"""
Extract financial facts from an XBRL instance document.
Returns a DataFrame with element name, value, context, and unit.
"""
tree = etree.parse(filepath)
root = tree.getroot()
# XBRL commonly used namespaces
ns = {
"xbrl": "http://www.xbrl.org/2003/instance",
"us-gaap": "http://fasb.org/us-gaap/2023-01-31",
"dei": "http://xbrl.sec.gov/dei/2023"
}
facts = []
# Extract all non-tuple, non-context/unit elements as facts
for elem in root.iter():
# Skip namespace declarations and structural elements
tag = etree.QName(elem.tag)
if tag.namespace in (
"http://www.xbrl.org/2003/instance",
"http://www.w3.org/2001/XMLSchema-instance"
):
continue
if elem.text and elem.text.strip():
facts.append({
"namespace": tag.namespace,
"element": tag.localname,
"value": elem.text.strip(),
"context_ref": elem.get("contextRef"),
"unit_ref": elem.get("unitRef"),
"decimals": elem.get("decimals")
})
df = pd.DataFrame(facts)
df["value_numeric"] = pd.to_numeric(df["value"], errors="coerce")
return dfWriting XML in Python
For completeness — building XML documents programmatically:
import xml.etree.ElementTree as ET
def build_product_catalog(products: list) -> str:
"""
Build an XML catalog document from a list of product dicts.
Returns the XML as a formatted string.
"""
root = ET.Element("catalog", version="2.0")
for p in products:
product_elem = ET.SubElement(root, "product",
id=p["product_id"],
featured=str(p.get("featured", False)).lower())
ET.SubElement(product_elem, "name").text = p["name"]
ET.SubElement(product_elem, "category").text = p.get("category", "")
price_elem = ET.SubElement(product_elem, "price", currency="USD")
price_elem.text = str(p["price"])
if p.get("tags"):
tags_elem = ET.SubElement(product_elem, "tags")
for tag in p["tags"]:
ET.SubElement(tags_elem, "tag").text = tag
# Format output (indent requires Python 3.9+)
ET.indent(root, space=" ")
return ET.tostring(root, encoding="unicode", xml_declaration=True)
# Build and save
products = [
{"product_id": "PROD_001", "name": "Wireless Headphones",
"category": "Electronics", "price": 149.99,
"featured": True, "tags": ["audio", "wireless"]},
{"product_id": "PROD_002", "name": "USB-C Hub",
"category": "Electronics", "price": 49.99,
"featured": False, "tags": ["usb", "hub"]},
]
xml_output = build_product_catalog(products)
with open("output/catalog.xml", "w", encoding="utf-8") as f:
f.write(xml_output)
print(xml_output[:500])pandas read_xml(): Direct XML to DataFrame
pandas 1.3+ added pd.read_xml(), which reads flat or mildly nested XML directly into a DataFrame:
import pandas as pd
# Read XML with a flat structure
df = pd.read_xml("data/customers.xml")
# Specify the element to use as rows with xpath
df = pd.read_xml(
"data/catalog.xml",
xpath="//product", # XPath to the repeating element
namespaces={"cat": "http://example.com/catalog/v2"}
)
# With namespace (requires lxml)
df = pd.read_xml(
"data/ns_catalog.xml",
xpath="//cat:product",
namespaces={"cat": "http://example.com/catalog/v2"}
)
# From a string
df = pd.read_xml(xml_string, xpath="//customer")
# Specify parser
df = pd.read_xml("data/catalog.xml", parser="lxml") # Faster for large files
print(df.head())pd.read_xml() works well for documents with a clean, flat repeated element structure. For documents with heavy nesting, mixed content, or complex namespace handling, the manual ElementTree approach gives more control.
Best Practices for XML Parsing in Data Science
Always Use findtext() Instead of find().text
# FRAGILE: AttributeError if element doesn't exist
name = product.find("name").text
# SAFE: Returns None (or default) if element not found
name = product.findtext("name")
name = product.findtext("name", default="Unknown")Use get() for Attributes
# FRAGILE: KeyError if attribute doesn't exist
product_id = product.attrib["id"]
# SAFE: Returns None or default if attribute not found
product_id = product.get("id")
product_id = product.get("id", "UNKNOWN")Validate Structure Before Processing
def safe_extract_products(root: ET.Element) -> list:
"""Extract products with validation before processing."""
if root.tag not in ("catalog", "products"):
raise ValueError(f"Unexpected root tag: {root.tag}")
products = root.findall("product")
if not products:
raise ValueError("No <product> elements found in document")
required_children = {"name", "price"}
records = []
for i, product in enumerate(products):
children_present = {child.tag for child in product}
missing = required_children - children_present
if missing:
print(f"Warning: product {i} missing required fields: {missing}")
continue
records.append({
"id": product.get("id"),
"name": product.findtext("name"),
"price": product.findtext("price")
})
return recordsHandle Encoding Explicitly
# Always specify encoding when reading XML from files
with open("data/catalog.xml", "r", encoding="utf-8") as f:
xml_content = f.read()
root = ET.fromstring(xml_content)
# Or let lxml detect it from the XML declaration
from lxml import etree
tree = etree.parse("data/catalog.xml") # lxml handles encoding automaticallySummary
XML parsing in Python has two primary tools: the built-in xml.etree.ElementTree module for standard use cases and the lxml library for large files, full XPath support, and namespace-heavy documents. Both work with the same fundamental concept — a tree of Element objects with .tag, .attrib, .text, and child relationships — making skills transfer directly between the two.
The most important practical skills are: finding elements with find() and findall(), extracting text safely with findtext(), extracting attributes safely with .get(), using namespace dictionaries to query namespace-qualified documents, flattening nested XML structures into pandas DataFrames using record-by-record iteration, and streaming large files with iterparse to avoid memory exhaustion.
XML will never be as convenient as JSON for data science work — its verbosity, namespace complexity, and lack of native array handling add friction. But when you encounter it in healthcare, finance, enterprise systems, or government data, the techniques in this article give you everything you need to extract clean, analyzable data from even the most complex XML documents.
Key Takeaways
ET.parse("file.xml").getroot()parses a file into an Element tree;ET.fromstring(xml_string)parses a string — both return a rootElementwith.tag,.attrib,.text, and child elementselement.find("tag")returns the first matching direct child orNone;element.findall("tag")returns all matching direct children as a list;element.iter("tag")recursively finds all descendants with that tag- Always use
element.findtext("tag")instead ofelement.find("tag").text—findtext()safely returnsNonewhen the element doesn’t exist instead of raisingAttributeError - Always use
element.get("attr")instead ofelement.attrib["attr"]for attributes —.get()safely returnsNonefor missing attributes - XML namespaces appear as
{uri}tagin parsed element tags — define a namespace dictionary and usefindall("prefix:tag", namespaces)syntax to query namespace-qualified documents cleanly lxmlprovides full XPath 1.0 withelement.xpath("//product[@id='X']/name/text()")and is 5-10× faster than ElementTree for large files — prefer it for XBRL, HL7, SOAP, and other complex XML vocabulariesiterparsestreams large XML files event-by-event without loading everything into memory — always callelem.clear()after processing each record to prevent memory accumulationpd.read_xml()handles simple flat XML structures in one line; for complex nesting, build the DataFrame manually by iterating over elements and extracting fields withfindtext()and.get()








