How to Parse XML Fast in 2026 (Python)
Source: Dev.to
JSON won the internet. We all know that. But XML never left — it just moved.
Reliability matters more than trendiness.
If you work with Maven configs, Android manifests, Office Open XML (.docx/.xlsx), or Python’s standard library (xml.etree.ElementTree), you already know XML works.
The usual answer for speed is lxml, which wraps libxml2 in C.
What if you need the fastest possible parse with a tiny footprint?
That question led me to build pygixml – a Python wrapper around pugixml, one of the fastest XML parsers.
Benchmarks
Parse time (5 000‑element document)
| Library | Parse time | Speed‑up vs. ElementTree |
|---|---|---|
| pygixml | 0.0009 s | 8.6× faster |
| lxml | 0.0041 s | 1.9× faster |
| ElementTree | 0.0076 s | 1.0× (baseline) |
Memory usage (same parse)
| Library | Peak memory |
|---|---|
| pygixml | 0.67 MB |
| lxml | 0.67 MB |
| ElementTree | 4.84 MB |
ElementTree uses ~7× more memory because it materializes every node as a Python object.
Package size
| Package | Size |
|---|---|
| pygixml | 0.43 MB |
| lxml | 5.48 MB |
12× difference – a big win for Docker images, AWS Lambda layers, etc.
All benchmarks were run on the same machine using time.perf_counter() across 5 runs (see the benchmarks/ directory).
Why pygixml is fast
- No Python object per node – the whole tree lives in C++ memory.
- Lazy Python wrappers – a wrapper is created only when you explicitly access a node.
- Zero‑copy Cython bridge – data isn’t copied between C++ and Python; strings are encoded in‑place.
- pugixml’s custom allocator – a block‑based memory pool reduces syscalls and improves cache locality.
Installation
pip install pygixml # one‑dependency‑free install, ~430 KBQuick start
import pygixml
xml = """
The Great Gatsby
F. Scott Fitzgerald
1925
1984
George Orwell
1949
"""
doc = pygixml.parse_string(xml)
root = doc.root
# Access children
book = root.child("book")
print(book.name) # book
print(book.attribute("id").value) # 1
print(book.child("title").text()) # The Great GatsbyAPI philosophy
- Simple properties for common access:
node.name,node.value,node.type. - Methods for hierarchical operations:
node.child(name),node.text(). - No surprises – the API stays intentionally minimal.
XPath support
# All fiction books
fiction = root.select_nodes("book[@category='fiction']")
print(f"Found {len(fiction)} fiction books")
# Single match
match = root.select_node("book[@id='2']")
if match:
print(match.node.child("title").text()) # 1984
# Pre‑compile for repeated use
query = pygixml.XPathQuery("book[year > 1950]")
recent = query.evaluate_node_set(root)
# Scalar evaluations
avg = pygixml.XPathQuery(
"sum(book/price) div count(book)"
).evaluate_number(root)
print(f"Average price: ${avg:.2f}")
has_orwell = pygixml.XPathQuery(
"book[author='George Orwell']"
).evaluate_boolean(root)
print(f"Has Orwell: {has_orwell}")Building XML documents
doc = pygixml.XMLDocument()
root = doc.append_child("catalog")
item = root.append_child("product")
item.append_child("name").set_value("Laptop")
item.append_child("price").set_value("999.99")
doc.save_file("catalog.xml")Modifying an existing document
doc = pygixml.parse_string("John")
root = doc.root
root.child("name").set_value("Jane")
root.child("name").name = "full_name"
root.append_child("age").set_value("30")
print(root.xml)
#
# Jane
# 30
# Parse flags – fine‑grained control
ParseFlags enum provides 18 options to tailor parsing exactly to your needs.
# Fastest possible parse – skip everything optional
doc = pygixml.parse_string(xml, pygixml.ParseFlags.MINIMAL)
# Pick exactly what you need (e.g., keep comments and CDATA)
flags = pygixml.ParseFlags.COMMENTS | pygixml.ParseFlags.CDATA
doc = pygixml.parse_string(xml, flags)ParseFlags.MINIMAL skips escape processing, EOL normalization, and entity expansion (&, <, …), giving a noticeable speed boost.
Feature comparison
| Feature | pygixml | lxml | ElementTree |
|---|---|---|---|
| Parse speed | Fastest | Fast | Slowest |
| Memory | Low | Low | High (≈ 7×) |
| Package size | 0.43 MB | 5.48 MB | Built‑in |
| XPath | 1.0 | 1.0 + 2.0 + 3.0 | Limited |
| XSLT | No | Yes | No |
| Schema validation | No | Yes | No |
| Dependencies | None | libxml2, libxslt | None |
Running the benchmarks yourself
git clone https://github.com/MohammadRaziei/pygixml.git
cd pygixmlThe project uses CMake; benchmarks are built‑in targets:
# Full suite: parsing (6 sizes), memory, package size
cmake --build build --target run_full_benchmarks
# Legacy parsing‑only benchmark
cmake --build build --target run_benchmarks
# Or directly with Python
python benchmarks/full_benchmark.pySample output (recent run)
=====================================================================
PARSING PERFORMANCE
=====================================================================
Size | Library | Avg (s) | Min (s) | Speedup vs ET
----------------------------------------------------------------------
100 | pygixml | 0.000008 | 0.000008 | 14.4x
100 | lxml | 0.000094 | 0.000088 | 1.2x
100 | elementtree | 0.000112 | 0.000108 | 1.0x
----------------------------------------------------------------------
500 | pygixml | 0.000097 | 0.000096 | 5.8x
500 | lxml | 0.000394 | 0.000385 | 1.4x
500 | elementtree | 0.000558 | 0.000542 | 1.0x
----------------------------------------------------------------------
1000 | pygixml | 0.000147 | 0.000143 | 7.8x
1000 | lxml | 0.001127 | 0.001052 | 1.2x
1000 | elementtree | 0.001215 | 0.001200 | 1.0x
...Bottom line: If you need blazing‑fast XML parsing with a minimal footprint and a clean, Pythonic API, give pygixml a try. 🚀
Benchmark Results
| Elements | Library | Parse (s) | Serialize (s) | Speedup |
|---|---|---|---|---|
| 1 000 | elementtree | 0.001146 | 0.001114 | 1.0× |
| ---------- | -------------- | ---------- | --------------- | -------- |
| 5 000 | pygixml | 0.000883 | 0.000880 | 8.6× |
| 5 000 | lxml | 0.004108 | 0.003907 | 1.9× |
| 5 000 | elementtree | 0.007614 | 0.006634 | 1.0× |
| ---------- | -------------- | ---------- | --------------- | -------- |
| 10 000 | pygixml | 0.001649 | 0.001635 | 9.8× |
| 10 000 | lxml | 0.009095 | 0.008174 | 1.8× |
| 10 000 | elementtree | 0.016108 | 0.013917 | 1.0× |
Memory usage (tracemalloc peak)
| Size (elements) | pygixml | lxml | ElementTree |
|---|---|---|---|
| 1 000 | 0.13 MB | 0.13 MB | 1.01 MB |
| 5 000 | 0.67 MB | 0.67 MB | 4.84 MB |
| 10 000 | 1.34 MB | 1.34 MB | 9.68 MB |
Package size
| Package | Size |
|---|---|
| pygixml | 0.43 MB |
| lxml | 5.48 MB |
XML isn’t going anywhere. The tools we use to process it matter more than the format itself. pygixml brings one of the fastest C++ XML parsers to Python with minimal overhead.
If you try it out, I’d love to hear about your use case. And if the project interests you, feel free to explore the links below.
- GitHub: pygixml on GitHub
- Documentation: pygixml on PyPI
- Underlying library: pugixml
Have a different XML‑parsing strategy? Drop it in the comments — I’m curious to see what works for you!