Three memory-leak patterns in long-running scrapers (and how I caught them after 968 Trustpilot runs)
Source: Dev.to
Introduction
Memory leaks in scrapers do not crash the run.
They quietly bump the Apify memory limit from 1 GB → 2 GB → 4 GB, double the per‑run cost, and are often only spotted weeks later on a compute‑unit invoice.
After 968 Trustpilot runs (~80–300 review pages each, ~150 k page hits cumulative) I started sampling RSS every 1 000 pages. The growth pattern told a different story than the logs. Below are the three patterns that account for ~90 % of the leaks I have seen across my 32 published Apify actors.
1. The unbounded asyncio queue
The most common pattern. A producer coroutine fetches URLs faster than the consumer parses them, so the in‑memory queue grows linearly with runtime.
# leaks at high concurrency
queue = asyncio.Queue() # no maxsize
async def producer():
async for url in source:
await queue.put(url) # never blocks
async def consumer():
while True:
url = await queue.get()
await process(url) # slower than source
If process() is slower than source (true for most JS‑rendered sites), the queue accumulates. On a Trustpilot run that fetched a company with 12 000 reviews, the queue held ~9 500 URLs at peak — about 380 MB of byte strings.
Fix
queue = asyncio.Queue(maxsize=200) # producer blocks at 200
A bounded queue forces the producer to wait. Memory stays flat; throughput drops modestly.
# Example: detect growth per 1 k pages
first, last = _samples[-3], _samples[-1]
growth_per_1k = (last[1] - first[1]) / ((last[0] - first[0]) / 1000)
if growth_per_1k > 50: # >50 MB per 1 000 pages
print(f"LEAK ALERT: +{growth_per_1k:.1f} MB/1k pages")
The threshold of 50 MB per 1 000 pages is conservative — anything above 20 MB on a steady‑state run is worth investigating. The output gets piped to Apify’s dataset, so I can grep across runs.
The cost angle nobody mentions
Memory leaks rarely crash a scraper. What they do is force you to bump the actor’s memory configuration:
- 1 GB → 2 GB: doubles compute‑unit consumption per second
- 2 GB → 4 GB: quadruples the cost, etc.
4 GB: quadruples it vs the 1 GB baseline
On Apify pricing, a 4 GB run at $0.0004 /CU‑second costs ~4× a properly‑tuned 1 GB run for the same wall‑clock time. Across 968 Trustpilot runs that would have been an extra ≈ $120 / year for nothing — pure operational waste because nobody profiled RSS.
Conclusion
The three patterns above cover ~90 % of leaks I have hit in production. Add the RSS probe to every long‑running scraper, set the leak threshold at 50 MB / 1k pages, and you will catch the next one in the first dev cycle instead of the next billing cycle.
- More production scraping notes: t.me/scraping_ai. Originally published at blog.spinov.online.