Thundering Herds: The Scalability Killer

Published: (January 1, 2026 at 03:00 AM EST)
8 min read
Source: Dev.to

Source: Dev.to

What Is a Thundering Herd?

At its core, the Thundering Herd occurs when a large number of processes are waiting for an event, and when the event happens they all wake up at once—yet only one can actually handle the event.

The term originated in OS kernel scheduling, but modern web engineers most frequently encounter it as a Cache Stampede.

The Golden State

  1. You have a high‑traffic endpoint cached in Redis. Everything is fast.
  2. The Expiry: The cache TTL (time‑to‑live) hits zero.
  3. The Stampede: 5 000 concurrent users refresh the page. They all see a cache miss.

The Collapse

All 5 000 requests hit your database simultaneously to re‑generate the same data. Your database experiences a surge in load, latency skyrockets, and the service goes down.

Note: 5 000 is an arbitrary number. The actual number depends on your system’s capacity.

Other Manifestations of the Herd

ScenarioWhat Happens
Auth service outagePrimary Auth service goes down for 5 minutes. Every other service retries. When Auth comes back up, it is immediately hammered by a flood of retry requests → Retry Storm.
Shared access token50 microservices share a machine‑to‑machine JWT. The token expires at the exact same second, causing all services to “thunder” toward the Identity Provider for a new token.
Cron job at midnightA heavy cleanup job is scheduled at 00 * * * * across 100 server nodes. At 12:00 AM all nodes hit the database or shared storage simultaneously.
Mobile app rolloutYou deploy a new 500 MB mobile binary. You notify 1 million users to download it immediately. The first thousands of requests miss the CDN edge and hit the origin server at once, potentially melting your storage layer.

Spotting the Herd

Look for these “herd signatures” in your monitoring dashboard:

  • Correlation of Cache Misses and Latency – A sudden spike in cache‑miss rates that perfectly aligns with a surge in p99 database latency.
  • Connection‑Pool Exhaustion – Database connection pool hitting its max limit within milliseconds.
  • CPU Context Switching – Massive spike in “System CPU” or context switches, indicating thousands of threads fighting for the same locks.
  • Error Logs – Thousands of “lock wait timeout” or “connection refused” errors occurring in a tight cluster.

Request Collapsing (Promise Memoization)

Request collapsing ensures that for any given resource only one upstream request is active at a time.

  • If Request A is already fetching user_data_123 from the database, Requests B, C, D should subscribe to the result of A instead of starting their own fetches.

The Busy‑Waiting Pitfall

A naïve lock can create a secondary herd: if 4 999 requests are waiting for that one DB call, they might poll “Is it ready yet?” every 10 ms, consuming CPU and creating a new herd in application memory.

Move to Event‑Based Notification

Switch from a push/poll model to a pull/notification model:

  • Instead of asking “Is it done?”, waiting requests should sleep and be woken up when the data is ready.

In Python or Node.js this is often handled natively by Promises or Futures. In other languages you might use Condition Variables or Channels.

Python Example Using asyncio

Below is a minimal example that demonstrates request collapsing on a single server. The “followers” simply await an Event, consuming zero CPU while they wait for the “leader” to finish the work.

import asyncio

class RequestCollapser:
    def __init__(self):
        # Stores the events for keys currently being fetched
        self.inflight_events = {}
        self.cache = {}

    async def get_data(self, key):
        # 1️⃣ Check if data is already in cache
        if key in self.cache:
            return self.cache[key]

        # 2️⃣ Check if someone else is already fetching it
        if key in self.inflight_events:
            print(f"Request for {key} joining the herd (waiting)...")
            event = self.inflight_events[key]
            await event.wait()               # Zero CPU usage while waiting
            return self.cache.get(key)

        # 3️⃣ Be the "Leader"
        print(f"Request for {key} is the LEADER. Fetching from DB...")
        event = asyncio.Event()
        self.inflight_events[key] = event

        try:
            # Simulate DB fetch
            await asyncio.sleep(1)
            data = "Fresh Data"
            self.cache[key] = data
            return data
        finally:
            # 4️⃣ Notify the herd
            event.set()                       # Wakes up all waiters instantly
            del self.inflight_events[key]

The example works perfectly for a single server. But what if you have 100 app servers? You still have 100 “leaders” potentially hammering the database.

Scaling Request Collapsing Across Multiple Instances

To extend collapsing across a fleet you need a distributed coordination layer (e.g., Redis, etcd, Zookeeper, or a dedicated lock service). The pattern stays the same:

  1. Attempt to acquire a distributed lock for the key.
  2. If you acquire the lock → you are the leader; fetch the data, store it, then release the lock and publish a notification (e.g., via Pub/Sub).
  3. If you fail to acquire the lock → subscribe to the notification channel and await the result.

Below is a high‑level sketch using Redis and its Pub/Sub feature (pseudo‑code for clarity):

import aioredis
import asyncio

REDIS_LOCK_TTL = 30          # seconds
CHANNEL_PREFIX = "herd:"

async def get_data_distributed(redis, key):
    cache_key = f"cache:{key}"
    lock_key  = f"lock:{key}"
    channel   = f"{CHANNEL_PREFIX}{key}"

    # 1️⃣ Try cache first
    cached = await redis.get(cache_key)
    if cached:
        return cached

    # 2️⃣ Try to become leader by acquiring a lock
    is_leader = await redis.set(lock_key, "1", nx=True, ex=REDIS_LOCK_TTL)
    if is_leader:
        # Leader: fetch from DB, populate cache, notify herd
        try:
            data = await fetch_from_db(key)          # your DB call
            await redis.set(cache_key, data, ex=300) # cache for 5 min
            await redis.publish(channel, data)       # wake up followers
            return data
        finally:
            await redis.delete(lock_key)             # release lock
    else:
        # Follower: wait for notification
        pubsub = redis.pubsub()
        await pubsub.subscribe(channel)

        async for message in pubsub.listen():
            if message["type"] == "message":
                return message["data"]

Key points

  • The lock guarantees only one leader per key across the whole fleet.
  • Followers block on Pub/Sub, consuming no CPU while they wait.
  • The lock has a TTL to avoid dead‑locks if the leader crashes.
  • The notification channel can be a simple string; the payload can be the fresh data or a “ready” flag prompting followers to read from the cache.

Adding Jitter to Prevent Synchronized Retries

Even with request collapsing, retry storms can still happen when a service becomes temporarily unavailable. Adding random jitter to back‑off intervals spreads retries over time.

import random
import asyncio

async def retry_with_jitter(coro, max_attempts=5, base_delay=0.5):
    for attempt in range(1, max_attempts + 1):
        try:
            return await coro()
        except Exception as e:
            if attempt == max_attempts:
                raise
            # Exponential back‑off + jitter
            jitter = random.uniform(0, base_delay)
            delay = (2 ** (attempt - 1)) * base_delay + jitter
            await asyncio.sleep(delay)
  • Exponential back‑off prevents overwhelming the target.
  • Jitter (randomness) ensures that retries from many clients don’t line up again.

Takeaways

  1. Identify herd signatures early (cache misses, connection‑pool spikes, CPU context switches, bursty errors).
  2. Collapse duplicate work with in‑process or distributed request collapsing.
  3. Use event‑driven notifications (Futures, Promises, Condition Variables, Pub/Sub) instead of busy‑waiting.
  4. Add jitter to any retry logic to avoid synchronized storms.
  5. Test at scale – simulate thousands of concurrent requests to verify that your mitigation holds under load.

By combining request collapsing with jittered retries, you can turn a potential Thundering Herd into a well‑behaved, resilient system. 🚀

Distributed Locks, Jitter, and Request Collapsing

When many nodes try to become the leader for a specific key, you can end up hitting your database all at once. Depending on the database, this may be a serious problem. To protect your system from this edge case, use distributed locks so that only one node in the entire cluster becomes the leader for a given key.

Solutions at Scale

TechniqueHow It WorksWhen to Use
Distributed Locks (Redis/Etcd)Use a library like Redlock to guarantee a single leader per key across the cluster.When you need strict coordination across multiple instances.
Singleflight Pattern (Go)The golang.org/x/sync/singleflight package collapses concurrent calls locally. Combine it with a distributed lock to protect both app memory and the database.Go services with high‑traffic single keys.
JitterIntroduce intentional randomness to stagger retries and TTL expirations.To avoid thundering‑herd spikes.
X‑Fetch (Probabilistic Refresh)Refresh a cache entry slightly before it expires, using a random “dice roll” to decide which request performs the refresh.Mission‑critical low‑latency data.

Jitter: Staggering Execution

When a request discovers that a resource is already being collapsed (another request is fetching it), don’t retry on a fixed interval.

  • Bad: Retry every 50 ms.
  • Good: Retry every 50 ms + random(0, 20 ms).

TTL Example

Never set a hard TTL on a large batch of keys.

  • Problem: Updating 10 000 products and setting each to expire in exactly 1 hour schedules a disaster for exactly 60 minutes from now.
  • Solution:
TTL = 3600 + (rand() * 120)   // spreads expirations over a 2‑minute window

X‑Fetch: Probabilistic Cache Refresh

Instead of waiting for the cache to expire, use jitter to trigger a refresh just before expiration.

  1. As the TTL approaches zero, each request performs a “dice roll.”
  2. If the roll is low, that request becomes the leader, re‑fetches the data, and resets the cache.
  3. All other requests continue to receive the stale‑but‑safe data.

Sample Python Implementation

import time
import random

async def get_resilient_data(key):
    cached = await cache.get(key)

    should_refresh = False

    # 1️⃣  Cache miss
    if cached is None:
        should_refresh = True
    else:
        # 2️⃣  Time remaining until expiry
        time_remaining = cached.expiry - time.time()

        # 3️⃣  Expired or probabilistic refresh
        if time_remaining < 0:  # placeholder condition
            should_refresh = True

Next time you set a cache TTL, ask:
“What happens if 10 000 people ask for this at the same time?”
If the answer is “they all wait for the DB,” it’s time to add some jitter.

Want More Resilient Distributed Systems?

If you enjoyed this deep dive, follow for more insights on building robust, scalable architectures.

Aonnis Valkey Operator – Deploy and manage high‑performance Valkey‑compatible clusters on Kubernetes with built‑in best practices for reliability and scale.

🚀 Surprise: It’s free for a limited time.

🔗 Visit www.aonnis.com to learn more.

Back to Blog

Related posts

Read more »

The RGB LED Sidequest 💡

markdown !Jennifer Davishttps://media2.dev.to/dynamic/image/width=50,height=50,fit=cover,gravity=auto,format=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%...

Mendex: Why I Build

Introduction Hello everyone. Today I want to share who I am, what I'm building, and why. Early Career and Burnout I started my career as a developer 17 years a...