Three memory-leak patterns in long-running scrapers (and how I caught them after 968 Trustpilot runs)

Published: (May 17, 2026 at 10:52 PM EDT)
3 min read
Source: Dev.to

Source: Dev.to

Alex Spinov

Introduction

Memory leaks in scrapers do not crash the run.
They quietly bump the Apify memory limit from 1 GB → 2 GB → 4 GB, double the per‑run cost, and are often only spotted weeks later on a compute‑unit invoice.

After 968 Trustpilot runs (~80–300 review pages each, ~150 k page hits cumulative) I started sampling RSS every 1 000 pages. The growth pattern told a different story than the logs. Below are the three patterns that account for ~90 % of the leaks I have seen across my 32 published Apify actors.


1. The unbounded asyncio queue

The most common pattern. A producer coroutine fetches URLs faster than the consumer parses them, so the in‑memory queue grows linearly with runtime.

# leaks at high concurrency
queue = asyncio.Queue()          # no maxsize

async def producer():
    async for url in source:
        await queue.put(url)     # never blocks

async def consumer():
    while True:
        url = await queue.get()
        await process(url)       # slower than source

If process() is slower than source (true for most JS‑rendered sites), the queue accumulates. On a Trustpilot run that fetched a company with 12 000 reviews, the queue held ~9 500 URLs at peak — about 380 MB of byte strings.

Fix

queue = asyncio.Queue(maxsize=200)   # producer blocks at 200

A bounded queue forces the producer to wait. Memory stays flat; throughput drops modestly.

# Example: detect growth per 1 k pages
first, last = _samples[-3], _samples[-1]
growth_per_1k = (last[1] - first[1]) / ((last[0] - first[0]) / 1000)
if growth_per_1k > 50:  # >50 MB per 1 000 pages
    print(f"LEAK ALERT: +{growth_per_1k:.1f} MB/1k pages")

The threshold of 50 MB per 1 000 pages is conservative — anything above 20 MB on a steady‑state run is worth investigating. The output gets piped to Apify’s dataset, so I can grep across runs.

The cost angle nobody mentions

Memory leaks rarely crash a scraper. What they do is force you to bump the actor’s memory configuration:

  • 1 GB → 2 GB: doubles compute‑unit consumption per second
  • 2 GB → 4 GB: quadruples the cost, etc.

4 GB: quadruples it vs the 1 GB baseline

On Apify pricing, a 4 GB run at $0.0004 /CU‑second costs ~4× a properly‑tuned 1 GB run for the same wall‑clock time. Across 968 Trustpilot runs that would have been an extra ≈ $120 / year for nothing — pure operational waste because nobody profiled RSS.

Conclusion

The three patterns above cover ~90 % of leaks I have hit in production. Add the RSS probe to every long‑running scraper, set the leak threshold at 50 MB / 1k pages, and you will catch the next one in the first dev cycle instead of the next billing cycle.

0 views
Back to Blog

Related posts

Read more »

Click (2016)

Article Click 2016https://clickclickclick.click/ Discussion Hacker News threadhttps://news.ycombinator.com/item?id=48187054 – 194 points, 41 comments...