How to Parse XML Fast in 2026 (Python)

Published: 3 weeks ago (April 8, 2026 at 07:53 PM EDT)

6 min read

Source: Dev.to

Source: Dev.to

JSON won the internet. We all know that. But XML never left — it just moved.
Reliability matters more than trendiness.

If you work with Maven configs, Android manifests, Office Open XML (.docx/.xlsx), or Python’s standard library (xml.etree.ElementTree), you already know XML works.
The usual answer for speed is lxml, which wraps libxml2 in C.

What if you need the fastest possible parse with a tiny footprint?
That question led me to build pygixml – a Python wrapper around pugixml, one of the fastest XML parsers.

Benchmarks

Parse time (5 000‑element document)

Library	Parse time	Speed‑up vs. ElementTree
pygixml	0.0009 s	8.6× faster
lxml	0.0041 s	1.9× faster
ElementTree	0.0076 s	1.0× (baseline)

Memory usage (same parse)

Library	Peak memory
pygixml	0.67 MB
lxml	0.67 MB
ElementTree	4.84 MB

ElementTree uses ~7× more memory because it materializes every node as a Python object.

Package size

Package	Size
pygixml	0.43 MB
lxml	5.48 MB

12× difference – a big win for Docker images, AWS Lambda layers, etc.

All benchmarks were run on the same machine using time.perf_counter() across 5 runs (see the benchmarks/ directory).

Why pygixml is fast

No Python object per node – the whole tree lives in C++ memory.
Lazy Python wrappers – a wrapper is created only when you explicitly access a node.
Zero‑copy Cython bridge – data isn’t copied between C++ and Python; strings are encoded in‑place.
pugixml’s custom allocator – a block‑based memory pool reduces syscalls and improves cache locality.

Installation

pip install pygixml   # one‑dependency‑free install, ~430 KB

Quick start

import pygixml

xml = """

    
        The Great Gatsby
        F. Scott Fitzgerald
        1925
    
    
        1984
        George Orwell
        1949
    

"""

doc  = pygixml.parse_string(xml)
root = doc.root

# Access children
book = root.child("book")
print(book.name)                      # book
print(book.attribute("id").value)     # 1
print(book.child("title").text())     # The Great Gatsby

API philosophy

Simple properties for common access: node.name, node.value, node.type.
Methods for hierarchical operations: node.child(name), node.text().
No surprises – the API stays intentionally minimal.

XPath support

# All fiction books
fiction = root.select_nodes("book[@category='fiction']")
print(f"Found {len(fiction)} fiction books")

# Single match
match = root.select_node("book[@id='2']")
if match:
    print(match.node.child("title").text())   # 1984

# Pre‑compile for repeated use
query = pygixml.XPathQuery("book[year > 1950]")
recent = query.evaluate_node_set(root)

# Scalar evaluations
avg = pygixml.XPathQuery(
    "sum(book/price) div count(book)"
).evaluate_number(root)
print(f"Average price: ${avg:.2f}")

has_orwell = pygixml.XPathQuery(
    "book[author='George Orwell']"
).evaluate_boolean(root)
print(f"Has Orwell: {has_orwell}")

Building XML documents

doc  = pygixml.XMLDocument()
root = doc.append_child("catalog")
item = root.append_child("product")
item.append_child("name").set_value("Laptop")
item.append_child("price").set_value("999.99")

doc.save_file("catalog.xml")

Modifying an existing document

doc  = pygixml.parse_string("John")
root = doc.root

root.child("name").set_value("Jane")
root.child("name").name = "full_name"
root.append_child("age").set_value("30")

print(root.xml)
# 
#   Jane
#   30
#

Parse flags – fine‑grained control

ParseFlags enum provides 18 options to tailor parsing exactly to your needs.

# Fastest possible parse – skip everything optional
doc = pygixml.parse_string(xml, pygixml.ParseFlags.MINIMAL)

# Pick exactly what you need (e.g., keep comments and CDATA)
flags = pygixml.ParseFlags.COMMENTS | pygixml.ParseFlags.CDATA
doc = pygixml.parse_string(xml, flags)

ParseFlags.MINIMAL skips escape processing, EOL normalization, and entity expansion (&, <, …), giving a noticeable speed boost.

Feature comparison

Feature	pygixml	lxml	ElementTree
Parse speed	Fastest	Fast	Slowest
Memory	Low	Low	High (≈ 7×)
Package size	0.43 MB	5.48 MB	Built‑in
XPath	1.0	1.0 + 2.0 + 3.0	Limited
XSLT	No	Yes	No
Schema validation	No	Yes	No
Dependencies	None	libxml2, libxslt	None

Running the benchmarks yourself

git clone https://github.com/MohammadRaziei/pygixml.git
cd pygixml

The project uses CMake; benchmarks are built‑in targets:

# Full suite: parsing (6 sizes), memory, package size
cmake --build build --target run_full_benchmarks

# Legacy parsing‑only benchmark
cmake --build build --target run_benchmarks

# Or directly with Python
python benchmarks/full_benchmark.py

Sample output (recent run)

=====================================================================
PARSING PERFORMANCE
=====================================================================
    Size | Library      |    Avg (s) |    Min (s) |  Speedup vs ET
----------------------------------------------------------------------
     100 | pygixml      |   0.000008 |   0.000008 |          14.4x
     100 | lxml         |   0.000094 |   0.000088 |           1.2x
     100 | elementtree  |   0.000112 |   0.000108 |           1.0x
----------------------------------------------------------------------
     500 | pygixml      |   0.000097 |   0.000096 |           5.8x
     500 | lxml         |   0.000394 |   0.000385 |           1.4x
     500 | elementtree  |   0.000558 |   0.000542 |           1.0x
----------------------------------------------------------------------
    1000 | pygixml      |   0.000147 |   0.000143 |           7.8x
    1000 | lxml         |   0.001127 |   0.001052 |           1.2x
    1000 | elementtree  |   0.001215 |   0.001200 |           1.0x
...

Bottom line: If you need blazing‑fast XML parsing with a minimal footprint and a clean, Pythonic API, give pygixml a try. 🚀

Benchmark Results

Elements	Library	Parse (s)	Serialize (s)	Speedup
1 000	elementtree	0.001146	0.001114	1.0×
----------	--------------	----------	---------------	--------
5 000	pygixml	0.000883	0.000880	8.6×
5 000	lxml	0.004108	0.003907	1.9×
5 000	elementtree	0.007614	0.006634	1.0×
----------	--------------	----------	---------------	--------
10 000	pygixml	0.001649	0.001635	9.8×
10 000	lxml	0.009095	0.008174	1.8×
10 000	elementtree	0.016108	0.013917	1.0×

Memory usage (tracemalloc peak)

Size (elements)	pygixml	lxml	ElementTree
1 000	0.13 MB	0.13 MB	1.01 MB
5 000	0.67 MB	0.67 MB	4.84 MB
10 000	1.34 MB	1.34 MB	9.68 MB

Package size

Package	Size
pygixml	0.43 MB
lxml	5.48 MB

XML isn’t going anywhere. The tools we use to process it matter more than the format itself. pygixml brings one of the fastest C++ XML parsers to Python with minimal overhead.

If you try it out, I’d love to hear about your use case. And if the project interests you, feel free to explore the links below.

GitHub: pygixml on GitHub
Documentation: pygixml on PyPI
Underlying library: pugixml

Have a different XML‑parsing strategy? Drop it in the comments — I’m curious to see what works for you!

How to Parse XML Fast in 2026 (Python)

Benchmarks

Parse time (5 000‑element document)

Memory usage (same parse)

Package size

Why pygixml is fast

Installation

Quick start

API philosophy

XPath support

Building XML documents

Modifying an existing document

Parse flags – fine‑grained control

Feature comparison

Running the benchmarks yourself

Sample output (recent run)

Benchmark Results

Memory usage (tracemalloc peak)

Package size

Related posts

How to Automate Your Life with Python Scripts - Updated April 12, 2026

Has the Rust Programming Language's Popularity Reached Its Plateau?

How to catch N+1 queries in EF Core before they hit production

Understanding Python Selenium Architecture

Benchmarks

Parse time (5 000‑element document)

Memory usage (same parse)

Package size

Why pygixml is fast

Installation

Quick start

API philosophy

XPath support

Building XML documents

Modifying an existing document

Parse flags – fine‑grained control

Feature comparison

Running the benchmarks yourself

Sample output (recent run)

Benchmark Results

Memory usage (tracemalloc peak)

Package size

Related posts

How to Automate Your Life with Python Scripts - Updated April 12, 2026

Has the Rust Programming Language's Popularity Reached Its Plateau?

How to catch N+1 queries in EF Core before they hit production

Understanding Python Selenium Architecture

Parse time (5 000‑element document)