How to Parse XML Fast in 2026 (Python)

Published: (April 8, 2026 at 07:53 PM EDT)
6 min read
Source: Dev.to

Source: Dev.to

JSON won the internet. We all know that. But XML never left — it just moved.
Reliability matters more than trendiness.

If you work with Maven configs, Android manifests, Office Open XML (.docx/.xlsx), or Python’s standard library (xml.etree.ElementTree), you already know XML works.
The usual answer for speed is lxml, which wraps libxml2 in C.

What if you need the fastest possible parse with a tiny footprint?
That question led me to build pygixml – a Python wrapper around pugixml, one of the fastest XML parsers.


Benchmarks

Parse time (5 000‑element document)

LibraryParse timeSpeed‑up vs. ElementTree
pygixml0.0009 s8.6× faster
lxml0.0041 s1.9× faster
ElementTree0.0076 s1.0× (baseline)

Memory usage (same parse)

LibraryPeak memory
pygixml0.67 MB
lxml0.67 MB
ElementTree4.84 MB

ElementTree uses ~7× more memory because it materializes every node as a Python object.

Package size

PackageSize
pygixml0.43 MB
lxml5.48 MB

12× difference – a big win for Docker images, AWS Lambda layers, etc.

All benchmarks were run on the same machine using time.perf_counter() across 5 runs (see the benchmarks/ directory).


Why pygixml is fast

  1. No Python object per node – the whole tree lives in C++ memory.
  2. Lazy Python wrappers – a wrapper is created only when you explicitly access a node.
  3. Zero‑copy Cython bridge – data isn’t copied between C++ and Python; strings are encoded in‑place.
  4. pugixml’s custom allocator – a block‑based memory pool reduces syscalls and improves cache locality.

Installation

pip install pygixml   # one‑dependency‑free install, ~430 KB

Quick start

import pygixml

xml = """

    
        The Great Gatsby
        F. Scott Fitzgerald
        1925
    
    
        1984
        George Orwell
        1949
    

"""

doc  = pygixml.parse_string(xml)
root = doc.root

# Access children
book = root.child("book")
print(book.name)                      # book
print(book.attribute("id").value)     # 1
print(book.child("title").text())     # The Great Gatsby

API philosophy

  • Simple properties for common access: node.name, node.value, node.type.
  • Methods for hierarchical operations: node.child(name), node.text().
  • No surprises – the API stays intentionally minimal.

XPath support

# All fiction books
fiction = root.select_nodes("book[@category='fiction']")
print(f"Found {len(fiction)} fiction books")

# Single match
match = root.select_node("book[@id='2']")
if match:
    print(match.node.child("title").text())   # 1984

# Pre‑compile for repeated use
query = pygixml.XPathQuery("book[year > 1950]")
recent = query.evaluate_node_set(root)

# Scalar evaluations
avg = pygixml.XPathQuery(
    "sum(book/price) div count(book)"
).evaluate_number(root)
print(f"Average price: ${avg:.2f}")

has_orwell = pygixml.XPathQuery(
    "book[author='George Orwell']"
).evaluate_boolean(root)
print(f"Has Orwell: {has_orwell}")

Building XML documents

doc  = pygixml.XMLDocument()
root = doc.append_child("catalog")
item = root.append_child("product")
item.append_child("name").set_value("Laptop")
item.append_child("price").set_value("999.99")

doc.save_file("catalog.xml")

Modifying an existing document

doc  = pygixml.parse_string("John")
root = doc.root

root.child("name").set_value("Jane")
root.child("name").name = "full_name"
root.append_child("age").set_value("30")

print(root.xml)
# 
#   Jane
#   30
# 

Parse flags – fine‑grained control

ParseFlags enum provides 18 options to tailor parsing exactly to your needs.

# Fastest possible parse – skip everything optional
doc = pygixml.parse_string(xml, pygixml.ParseFlags.MINIMAL)

# Pick exactly what you need (e.g., keep comments and CDATA)
flags = pygixml.ParseFlags.COMMENTS | pygixml.ParseFlags.CDATA
doc = pygixml.parse_string(xml, flags)

ParseFlags.MINIMAL skips escape processing, EOL normalization, and entity expansion (&, <, …), giving a noticeable speed boost.


Feature comparison

FeaturepygixmllxmlElementTree
Parse speedFastestFastSlowest
MemoryLowLowHigh (≈ 7×)
Package size0.43 MB5.48 MBBuilt‑in
XPath1.01.0 + 2.0 + 3.0Limited
XSLTNoYesNo
Schema validationNoYesNo
DependenciesNonelibxml2, libxsltNone

Running the benchmarks yourself

git clone https://github.com/MohammadRaziei/pygixml.git
cd pygixml

The project uses CMake; benchmarks are built‑in targets:

# Full suite: parsing (6 sizes), memory, package size
cmake --build build --target run_full_benchmarks

# Legacy parsing‑only benchmark
cmake --build build --target run_benchmarks

# Or directly with Python
python benchmarks/full_benchmark.py

Sample output (recent run)

=====================================================================
PARSING PERFORMANCE
=====================================================================
    Size | Library      |    Avg (s) |    Min (s) |  Speedup vs ET
----------------------------------------------------------------------
     100 | pygixml      |   0.000008 |   0.000008 |          14.4x
     100 | lxml         |   0.000094 |   0.000088 |           1.2x
     100 | elementtree  |   0.000112 |   0.000108 |           1.0x
----------------------------------------------------------------------
     500 | pygixml      |   0.000097 |   0.000096 |           5.8x
     500 | lxml         |   0.000394 |   0.000385 |           1.4x
     500 | elementtree  |   0.000558 |   0.000542 |           1.0x
----------------------------------------------------------------------
    1000 | pygixml      |   0.000147 |   0.000143 |           7.8x
    1000 | lxml         |   0.001127 |   0.001052 |           1.2x
    1000 | elementtree  |   0.001215 |   0.001200 |           1.0x
...

Bottom line: If you need blazing‑fast XML parsing with a minimal footprint and a clean, Pythonic API, give pygixml a try. 🚀

Benchmark Results

ElementsLibraryParse (s)Serialize (s)Speedup
1 000elementtree0.0011460.0011141.0×
---------------------------------------------------------
5 000pygixml0.0008830.0008808.6×
5 000lxml0.0041080.0039071.9×
5 000elementtree0.0076140.0066341.0×
---------------------------------------------------------
10 000pygixml0.0016490.0016359.8×
10 000lxml0.0090950.0081741.8×
10 000elementtree0.0161080.0139171.0×

Memory usage (tracemalloc peak)

Size (elements)pygixmllxmlElementTree
1 0000.13 MB0.13 MB1.01 MB
5 0000.67 MB0.67 MB4.84 MB
10 0001.34 MB1.34 MB9.68 MB

Package size

PackageSize
pygixml0.43 MB
lxml5.48 MB

XML isn’t going anywhere. The tools we use to process it matter more than the format itself. pygixml brings one of the fastest C++ XML parsers to Python with minimal overhead.

If you try it out, I’d love to hear about your use case. And if the project interests you, feel free to explore the links below.

Have a different XML‑parsing strategy? Drop it in the comments — I’m curious to see what works for you!

0 views
Back to Blog

Related posts

Read more »