JSON Parsing for Large Payloads: Balancing Speed, Memory, and Scalability
Source: Dev.to
Introduction
Imagine that the marketing campaign you set up for Black Friday was a massive success, and customers start pouring into your website. Your Mixpanel setup, which would usually have around 1 000 customer events an hour, ends up receiving millions of events within the same hour. Consequently, your data pipeline is now tasked with parsing vast amounts of JSON data and storing it in your database.
Your standard JSON parsing library can’t keep up with the sudden data growth, and your near‑real‑time analytics reports fall behind. This is when you realize the importance of an efficient JSON parsing library. In addition to handling large payloads, a good library should be able to serialize and deserialize highly nested JSON structures.
In this article we explore Python parsing libraries for large payloads. We specifically look at the capabilities of ujson, orjson, and ijson, and we benchmark the standard library (json), ujson, and orjson for serialization and deserialization performance.
Serialization = converting Python objects to a JSON string.
Deserialization = rebuilding Python objects from a JSON string.
A decision‑flow diagram (shown later) helps you choose the right parser for your workflow. We also cover NDJSON and libraries that can parse NDJSON payloads. Let’s get started.
Stdlib json
The standard library supports serialization for all basic Python data types (dicts, lists, tuples, etc.). When you call json.loads(), the entire JSON document is loaded into memory at once. This works fine for small payloads, but for large payloads it can cause:
- Out‑of‑memory errors
- Choking of downstream workflows
import json
with open("large_payload.json", "r") as f:
json_data = json.loads(f) # loads entire file into memory, all tokens at once
ijson
For payloads in the hundreds of megabytes range, ijson (short for iterative json) reads files one token at a time, avoiding the memory overhead of loading the whole document.
import ijson
with open("json_data.json", "r") as f:
# fetch one dict from the array at a time
for record in ijson.items(f, "items.item"):
process(record) # the ijson library reads records one token at a time
ijson therefore streams each element, converts it to a Python dict, and hands it to your processing function (process(record)).

ujson

ujson has long been a popular choice for large JSON payloads because it is a C‑based implementation with Python bindings, making it considerably faster than the pure‑Python json module.
Note: The maintainers have placed
ujsonin maintenance‑only mode, so new projects typically preferorjson.
import ujson
taxonomy_data = (
'{"id":1, "genus":"Thylacinus", "species":"cynocephalus", "extinct": true}'
)
# Deserialize
data_dict = ujson.loads(taxonomy_data)
# Serialize
with open("taxonomy_data.json", "w") as fh:
ujson.dump(data_dict, fh)
# Deserialize again
with open("taxonomy_data.json", "r") as fh:
data = ujson.load(fh)
print(data)
orjson
orjson is written in Rust, giving it both speed and memory‑safety guarantees that C‑based libraries (like ujson) lack. It also supports serializing additional Python types such as dataclass and datetime.
A key difference: orjson.dumps() returns bytes, whereas the other libraries return a string. Returning bytes eliminates an extra encoding step, contributing to orjson’s high throughput.
import json
import orjson
# Example payload
book_payload = (
'{"id":1,"name":"The Great Gatsby","author":"F. Scott Fitzgerald"}'
)
# Serialize to bytes
json_bytes = orjson.dumps(json.loads(book_payload))
# Deserialize back to a Python object
obj = orjson.loads(json_bytes)
print(obj)
Decision Flow Diagram
Below is a simplified flow to help you pick the right parser:
+-------------------+
| Payload size? |
+--------+----------+
|
+-------------+-------------+
| |
100 MB)** – stream with `ijson`.
NDJSON (Newline‑Delimited JSON)
When dealing with log‑style data, NDJSON is often a better fit because each line is a valid JSON document. You can parse NDJSON with:
- Standard
json– read line‑by‑line. orjson– fast line‑by‑line deserialization (orjson.loads(line)).ijson– also works, but the line‑by‑line approach is usually simpler.
import orjson
with open("events.ndjson", "r") as f:
for line in f:
event = orjson.loads(line)
process(event)
Summary
| Library | Language | Speed | Memory usage | Streaming support | Extra features |
|---|---|---|---|---|---|
json (stdlib) | Python (C) | Baseline | High (loads whole doc) | No | None |
ujson | C | Fast | Moderate (loads whole doc) | No | Maintenance‑only |
orjson | Rust | Fastest | Low (bytes output) | No | Dataclass, datetime, UUID, etc. |
ijson | Python (C) | Moderate (streaming) | Very low | Yes | Event‑based parsing |
For most new projects:
- Use
orjsonfor speed and extra type support when the payload fits in memory. - Switch to
ijsonfor truly massive payloads or when you need to process data incrementally.
Happy parsing!
JSON Parsing and Serialization with json, ujson, and orjson
import json
import ujson
import orjson
# Sample JSON payload
book_payload = '{"Title":"The Great Gatsby","Author":"F. Scott Fitzgerald","Publishing House":"Charles Scribner\'s Sons"}'
# Deserialize with orjson
data_dict = orjson.loads(book_payload)
print(data_dict)
# Serialize to a file
with open("book_data.json", "wb") as f:
f.write(orjson.dumps(data_dict)) # Returns a bytes object
# Deserialize from the file
with open("book_data.json", "rb") as f:
book_data = orjson.loads(f.read())
print(book_data)
Testing Serialization Capabilities of json, ujson, and orjson
We create a sample dataclass object that contains an integer, a string, and a datetime value.
from dataclasses import dataclass
from datetime import datetime
@dataclass
class User:
id: int
name: str
created: datetime
u = User(id=1, name="Thomas", created=datetime.now())
1. Standard Library json
import json
try:
print("json:", json.dumps(u))
except TypeError as e:
print("json error:", e)
Result: json raises a TypeError because it cannot serialize dataclass instances or datetime objects.
2. ujson
import ujson
try:
print("ujson:", ujson.dumps(u))
except TypeError as e:
print("ujson error:", e)
Result: ujson also fails to serialize the dataclass and the datetime value.
3. orjson
import orjson
try:
print("orjson:", orjson.dumps(u))
except TypeError as e:
print("orjson error:", e)
Result: orjson successfully serializes both the dataclass and the datetime object.
Working with NDJSON (Newline‑Delimited JSON)
NDJSON is a format where each line is a separate JSON object, e.g.:
{"id": "A13434", "name": "Ella"}
{"id": "A13455", "name": "Charmont"}
{"id": "B32434", "name": "Areida"}
It is commonly used for logs and streaming data. Below are three approaches to handling NDJSON in Python.
NDJSON with the Standard Library json
import json
ndjson_payload = """{"id": "A13434", "name": "Ella"}
{"id": "A13455", "name": "Charmont"}
{"id": "B32434", "name": "Areida"}"""
# Write the payload to a file
with open("json_lib.ndjson", "w", encoding="utf-8") as fh:
for line in ndjson_payload.splitlines():
fh.write(line.strip() + "\n")
# Read and process line‑by‑line
with open("json_lib.ndjson", "r", encoding="utf-8") as fh:
for line in fh:
if line.strip(): # Skip empty lines
item = json.loads(line) # Deserialize
print(item) # Or pass to a caller function
NDJSON with ijson (streaming parser)
import ijson
ndjson_payload = """{"id": "A13434", "name": "Ella"}
{"id": "A13455", "name": "Charmont"}
{"id": "B32434", "name": "Areida"}"""
# Write the payload to a file
with open("ijson_lib.ndjson", "w", encoding="utf-8") as fh:
fh.write(ndjson_payload)
# Parse iteratively
with open("ijson_lib.ndjson", "r", encoding="utf-8") as fh:
for item in ijson.items(fh, "", multiple_values=True):
print(item)
Explanation: ijson.items(fh, "", multiple_values=True) treats each root element (each line) as a separate JSON object and yields them one at a time.
NDJSON with the Dedicated ndjson Library
import ndjson
ndjson_payload = """{"id": "A13434", "name": "Ella"}
{"id": "A13455", "name": "Charmont"}
{"id": "B32434", "name": "Areida"}"""
# Write the payload to a file
with open("ndjson_lib.ndjson", "w", encoding="utf-8") as fh:
fh.write(ndjson_payload)
# Load the file – returns a list of dictionaries
with open("ndjson_lib.ndjson", "r", encoding="utf-8") as fh:
ndjson_data = ndjson.load(fh)
print(ndjson_data)
Takeaways
- For small‑to‑moderate NDJSON payloads, the standard
jsonmodule works fine when you read line‑by‑line. - For very large payloads,
ijsonis the best choice because it streams data and uses minimal memory. - If you need to generate NDJSON from Python objects, the
ndjsonlibrary is convenient (ndjson.dumps()handles the conversion automatically).
Why ijson Is Not Included in Benchmarking
ijson is a streaming parser, fundamentally different from the bulk parsers (json, ujson, orjson) we benchmarked. Comparing a streaming parser with bulk parsers would be an “apples‑to‑oranges” comparison:
- Bulk parsers load the entire JSON document into memory, optimizing for speed.
ijsonprocesses the document incrementally, optimizing for memory efficiency.
Including ijson in a speed‑only benchmark would misleadingly label it as the slowest, ignoring its primary advantage—low memory consumption for massive JSON streams. Therefore, ijson is evaluated separately when memory usage is the primary concern.
Generating a Synthetic JSON Payload for Benchmarking Purposes
We generate a large synthetic JSON payload containing 1 million records using the library mimesis. This data can be used to benchmark JSON libraries. The code below creates the payload; the resulting file is roughly 100 – 150 MB, which is large enough for meaningful performance tests.
from mimesis import Person, Address
import json
person_name = Person("en")
complete_address = Address("en")
with open("large_payload.json", "w") as fh: # Streaming to a file
fh.write("[") # JSON array start
for i in range(1_000_000):
payload = {
"id": person_name.identifier(),
"name": person_name.full_name(),
"email": person_name.email(),
"address": {
"street": complete_address.street_name(),
"city": complete_address.city(),
"postal_code": complete_address.postal_code()
}
}
json.dump(payload, fh)
# Add a comma after every element except the last one
if i < 999_999:
fh.write(",")
fh.write("]") # JSON array end
Sample Output
[
{
"id": "8177",
"name": "Willia Hays",
"email": "showers1819@yandex.com",
"address": {
"street": "Emerald Cove",
"city": "Crown Point",
"postal_code": "58293"
}
},
{
"id": "5931",
"name": "Quinn Greer",
"email": "professional2038@outlook.com",
"address": {
"street": "Ohlone",
"city": "Bridgeport",
"postal_code": "92982"
}
}
]
Let’s Start with Benchmarking
Benchmarking Prerequisites
We read the JSON file into a string and then use each library’s loads() function to deserialize it.
with open("large_payload1.json", "r") as fh:
payload_str = fh.read() # raw JSON text
A helper function runs a given loads implementation three times and returns the total elapsed time.
import time
def benchmark_load(func, payload_str):
start = time.perf_counter()
for _ in range(3):
func(payload_str)
end = time.perf_counter()
return end - start
Benchmarking Deserialization Speed
import json, ujson, orjson
results = {
"json.loads": benchmark_load(json.loads, payload_str),
"ujson.loads": benchmark_load(ujson.loads, payload_str),
"orjson.loads": benchmark_load(orjson.loads, payload_str),
}
for lib, t in results.items():
print(f"{lib}: {t:.4f} seconds")
Result: orjson is the fastest for deserialization.
Benchmarking Serialization Speed
import json, ujson, orjson
def benchmark_dump(func, obj):
start = time.perf_counter()
for _ in range(3):
func(obj)
end = time.perf_counter()
return end - start
# Example object (already loaded)
example_obj = json.loads(payload_str)
ser_results = {
"json.dumps": benchmark_dump(json.dumps, example_obj),
"ujson.dumps": benchmark_dump(ujson.dumps, example_obj),
"orjson.dumps": benchmark_dump(orjson.dumps, example_obj),
}
for lib, t in ser_results.items():
print(f"{lib}: {t:.4f} seconds")