Thundering Herds: The Scalability Killer
Source: Dev.to
What Is a Thundering Herd?
At its core, the Thundering Herd occurs when a large number of processes are waiting for an event, and when the event happens they all wake up at once—yet only one can actually handle the event.
The term originated in OS kernel scheduling, but modern web engineers most frequently encounter it as a Cache Stampede.
The Golden State
- You have a high‑traffic endpoint cached in Redis. Everything is fast.
- The Expiry: The cache TTL (time‑to‑live) hits zero.
- The Stampede: 5 000 concurrent users refresh the page. They all see a cache miss.
The Collapse
All 5 000 requests hit your database simultaneously to re‑generate the same data. Your database experiences a surge in load, latency skyrockets, and the service goes down.
Note: 5 000 is an arbitrary number. The actual number depends on your system’s capacity.
Other Manifestations of the Herd
| Scenario | What Happens |
|---|---|
| Auth service outage | Primary Auth service goes down for 5 minutes. Every other service retries. When Auth comes back up, it is immediately hammered by a flood of retry requests → Retry Storm. |
| Shared access token | 50 microservices share a machine‑to‑machine JWT. The token expires at the exact same second, causing all services to “thunder” toward the Identity Provider for a new token. |
| Cron job at midnight | A heavy cleanup job is scheduled at 00 * * * * across 100 server nodes. At 12:00 AM all nodes hit the database or shared storage simultaneously. |
| Mobile app rollout | You deploy a new 500 MB mobile binary. You notify 1 million users to download it immediately. The first thousands of requests miss the CDN edge and hit the origin server at once, potentially melting your storage layer. |
Spotting the Herd
Look for these “herd signatures” in your monitoring dashboard:
- Correlation of Cache Misses and Latency – A sudden spike in cache‑miss rates that perfectly aligns with a surge in p99 database latency.
- Connection‑Pool Exhaustion – Database connection pool hitting its max limit within milliseconds.
- CPU Context Switching – Massive spike in “System CPU” or context switches, indicating thousands of threads fighting for the same locks.
- Error Logs – Thousands of “lock wait timeout” or “connection refused” errors occurring in a tight cluster.
Request Collapsing (Promise Memoization)
Request collapsing ensures that for any given resource only one upstream request is active at a time.
- If Request A is already fetching
user_data_123from the database, Requests B, C, D should subscribe to the result of A instead of starting their own fetches.
The Busy‑Waiting Pitfall
A naïve lock can create a secondary herd: if 4 999 requests are waiting for that one DB call, they might poll “Is it ready yet?” every 10 ms, consuming CPU and creating a new herd in application memory.
Move to Event‑Based Notification
Switch from a push/poll model to a pull/notification model:
- Instead of asking “Is it done?”, waiting requests should sleep and be woken up when the data is ready.
In Python or Node.js this is often handled natively by Promises or Futures. In other languages you might use Condition Variables or Channels.
Python Example Using asyncio
Below is a minimal example that demonstrates request collapsing on a single server. The “followers” simply await an Event, consuming zero CPU while they wait for the “leader” to finish the work.
import asyncio
class RequestCollapser:
def __init__(self):
# Stores the events for keys currently being fetched
self.inflight_events = {}
self.cache = {}
async def get_data(self, key):
# 1️⃣ Check if data is already in cache
if key in self.cache:
return self.cache[key]
# 2️⃣ Check if someone else is already fetching it
if key in self.inflight_events:
print(f"Request for {key} joining the herd (waiting)...")
event = self.inflight_events[key]
await event.wait() # Zero CPU usage while waiting
return self.cache.get(key)
# 3️⃣ Be the "Leader"
print(f"Request for {key} is the LEADER. Fetching from DB...")
event = asyncio.Event()
self.inflight_events[key] = event
try:
# Simulate DB fetch
await asyncio.sleep(1)
data = "Fresh Data"
self.cache[key] = data
return data
finally:
# 4️⃣ Notify the herd
event.set() # Wakes up all waiters instantly
del self.inflight_events[key]
The example works perfectly for a single server. But what if you have 100 app servers? You still have 100 “leaders” potentially hammering the database.
Scaling Request Collapsing Across Multiple Instances
To extend collapsing across a fleet you need a distributed coordination layer (e.g., Redis, etcd, Zookeeper, or a dedicated lock service). The pattern stays the same:
- Attempt to acquire a distributed lock for the key.
- If you acquire the lock → you are the leader; fetch the data, store it, then release the lock and publish a notification (e.g., via Pub/Sub).
- If you fail to acquire the lock → subscribe to the notification channel and await the result.
Below is a high‑level sketch using Redis and its Pub/Sub feature (pseudo‑code for clarity):
import aioredis
import asyncio
REDIS_LOCK_TTL = 30 # seconds
CHANNEL_PREFIX = "herd:"
async def get_data_distributed(redis, key):
cache_key = f"cache:{key}"
lock_key = f"lock:{key}"
channel = f"{CHANNEL_PREFIX}{key}"
# 1️⃣ Try cache first
cached = await redis.get(cache_key)
if cached:
return cached
# 2️⃣ Try to become leader by acquiring a lock
is_leader = await redis.set(lock_key, "1", nx=True, ex=REDIS_LOCK_TTL)
if is_leader:
# Leader: fetch from DB, populate cache, notify herd
try:
data = await fetch_from_db(key) # your DB call
await redis.set(cache_key, data, ex=300) # cache for 5 min
await redis.publish(channel, data) # wake up followers
return data
finally:
await redis.delete(lock_key) # release lock
else:
# Follower: wait for notification
pubsub = redis.pubsub()
await pubsub.subscribe(channel)
async for message in pubsub.listen():
if message["type"] == "message":
return message["data"]
Key points
- The lock guarantees only one leader per key across the whole fleet.
- Followers block on Pub/Sub, consuming no CPU while they wait.
- The lock has a TTL to avoid dead‑locks if the leader crashes.
- The notification channel can be a simple string; the payload can be the fresh data or a “ready” flag prompting followers to read from the cache.
Adding Jitter to Prevent Synchronized Retries
Even with request collapsing, retry storms can still happen when a service becomes temporarily unavailable. Adding random jitter to back‑off intervals spreads retries over time.
import random
import asyncio
async def retry_with_jitter(coro, max_attempts=5, base_delay=0.5):
for attempt in range(1, max_attempts + 1):
try:
return await coro()
except Exception as e:
if attempt == max_attempts:
raise
# Exponential back‑off + jitter
jitter = random.uniform(0, base_delay)
delay = (2 ** (attempt - 1)) * base_delay + jitter
await asyncio.sleep(delay)
- Exponential back‑off prevents overwhelming the target.
- Jitter (randomness) ensures that retries from many clients don’t line up again.
Takeaways
- Identify herd signatures early (cache misses, connection‑pool spikes, CPU context switches, bursty errors).
- Collapse duplicate work with in‑process or distributed request collapsing.
- Use event‑driven notifications (Futures, Promises, Condition Variables, Pub/Sub) instead of busy‑waiting.
- Add jitter to any retry logic to avoid synchronized storms.
- Test at scale – simulate thousands of concurrent requests to verify that your mitigation holds under load.
By combining request collapsing with jittered retries, you can turn a potential Thundering Herd into a well‑behaved, resilient system. 🚀
Distributed Locks, Jitter, and Request Collapsing
When many nodes try to become the leader for a specific key, you can end up hitting your database all at once. Depending on the database, this may be a serious problem. To protect your system from this edge case, use distributed locks so that only one node in the entire cluster becomes the leader for a given key.
Solutions at Scale
| Technique | How It Works | When to Use |
|---|---|---|
| Distributed Locks (Redis/Etcd) | Use a library like Redlock to guarantee a single leader per key across the cluster. | When you need strict coordination across multiple instances. |
| Singleflight Pattern (Go) | The golang.org/x/sync/singleflight package collapses concurrent calls locally. Combine it with a distributed lock to protect both app memory and the database. | Go services with high‑traffic single keys. |
| Jitter | Introduce intentional randomness to stagger retries and TTL expirations. | To avoid thundering‑herd spikes. |
| X‑Fetch (Probabilistic Refresh) | Refresh a cache entry slightly before it expires, using a random “dice roll” to decide which request performs the refresh. | Mission‑critical low‑latency data. |
Jitter: Staggering Execution
When a request discovers that a resource is already being collapsed (another request is fetching it), don’t retry on a fixed interval.
- Bad: Retry every 50 ms.
- Good: Retry every 50 ms + random(0, 20 ms).
TTL Example
Never set a hard TTL on a large batch of keys.
- Problem: Updating 10 000 products and setting each to expire in exactly 1 hour schedules a disaster for exactly 60 minutes from now.
- Solution:
TTL = 3600 + (rand() * 120) // spreads expirations over a 2‑minute window
X‑Fetch: Probabilistic Cache Refresh
Instead of waiting for the cache to expire, use jitter to trigger a refresh just before expiration.
- As the TTL approaches zero, each request performs a “dice roll.”
- If the roll is low, that request becomes the leader, re‑fetches the data, and resets the cache.
- All other requests continue to receive the stale‑but‑safe data.
Sample Python Implementation
import time
import random
async def get_resilient_data(key):
cached = await cache.get(key)
should_refresh = False
# 1️⃣ Cache miss
if cached is None:
should_refresh = True
else:
# 2️⃣ Time remaining until expiry
time_remaining = cached.expiry - time.time()
# 3️⃣ Expired or probabilistic refresh
if time_remaining < 0: # placeholder condition
should_refresh = True
Next time you set a cache TTL, ask:
“What happens if 10 000 people ask for this at the same time?”
If the answer is “they all wait for the DB,” it’s time to add some jitter.
Want More Resilient Distributed Systems?
If you enjoyed this deep dive, follow for more insights on building robust, scalable architectures.
Aonnis Valkey Operator – Deploy and manage high‑performance Valkey‑compatible clusters on Kubernetes with built‑in best practices for reliability and scale.
🚀 Surprise: It’s free for a limited time.
🔗 Visit www.aonnis.com to learn more.