How a Cache Invalidation Bug Nearly Took Down Our System - And What We Changed After

Published: (December 4, 2025 at 08:57 PM EST)
4 min read
Source: Dev.to

Source: Dev.to

🎬 The Setup

The night before the incident, we upgraded our Aurora MySQL engine version.
Everything looked good—no alarms, no red flags.

The next morning around 8 AM, our daily job kicked in—the one responsible for:

  • Deleting the stale “master data” cache
  • Refetching fresh master data from the DB
  • Storing it back in cache

This master dataset is used by the application to work correctly, so if the cache isn’t warm, the DB gets hammered.


💥 The Explosion

Right after the engine upgrade, a specific query in the Lambda suddenly started taking 30 + seconds.
Our Lambda had a 30‑second timeout, so the cacheInvalidate → cacheRebuild flow failed:

  • The cache remained empty.
  • Every user request resulted in a cache miss.
  • All those requests hit the DB directly.
  • Aurora CPU spiked to 99 %.
  • Application responses stalled across the board.

Classic cache stampede.

We eventually triggered a failover, and luckily the same query ran in ~28.7 seconds on the new writer—just under the Lambda timeout—buying us a few minutes to stabilize.

Later that night we discovered the real culprit: the query needed a new index, and the upgrade changed its execution plan. We created the index via a hotfix, and the DB stabilized. The deeper problem was our cache invalidation approach.

🧹 Our Original Cache Invalidation: Delete First, Hope Later

Our initial flow was:

  1. Delete the existing cache key
  2. Fetch fresh data from the DB
  3. Save it back to cache

If step 2 fails, everything collapses. In our case, the Lambda failed to fetch fresh data, so the cache stayed empty.

🔧 What We Changed (and Recommend)

1. Never delete the cache before you have fresh data

We inverted the flow:

  • Fetch → Validate → Update cache
  • Only delete if we already have fresh data ready

This eliminates the “empty cache” window.

2. Use “stale rollover” instead of blunt deletion

If the refresh job fails, we now:

  1. Rename the key
    "Master-Data""Master-Data-Stale"
  2. Keep the old value available
  3. Add an internal notification so the team can investigate

This ensures that even if the DB is slow or down, the system still has something to serve. It’s not ideal, but it prevents a meltdown.

3. API layer now returns stale data when fresh data is unavailable

The API logic became:

  1. Try to read "Master-Data"
  2. If not found, attempt to rebuild (only if allowed)
  3. If rebuild fails → return stale data

This avoids cascading failures.

4. Add a Redis distributed lock to prevent cache stampede

Without this, multiple API nodes or Lambdas could all try to rebuild simultaneously, hammering the DB again. With a Redis lock:

  • Only one request gets the lock and rebuilds.
  • Others do not hit the DB; they simply return stale data or wait for the winner to repopulate the cache.

Node.js – Acquire Distributed Lock (Redis)

// redis.js
const { createClient } = require("redis");

const redis = createClient({
  url: process.env.REDIS_URL
});
redis.connect();

module.exports = redis;

Acquiring and Releasing the Lock

// lock.js
const redis = require("./redis");
const { randomUUID } = require("crypto");

const LOCK_KEY = "lock:master-data-refresh";
const LOCK_TTL = 10000; // 10 seconds

async function acquireLock() {
  const lockId = randomUUID();

  const result = await redis.set(LOCK_KEY, lockId, {
    NX: true,
    PX: LOCK_TTL
  });

  if (result === "OK") {
    return lockId; // lock acquired
  }

  return null; // lock not acquired
}

async function releaseLock(lockId) {
  const current = await redis.get(LOCK_KEY);

  if (current === lockId) {
    await redis.del(LOCK_KEY);
  }
}

module.exports = { acquireLock, releaseLock };

Usage

const { acquireLock, releaseLock } = require("./lock");

async function refreshMasterData() {
  const lockId = await acquireLock();

  if (!lockId) {
    console.log("Another request is refreshing. Returning stale data.");
    return getStaleData();
  }

  try {
    const newData = await fetchFromDB();
    await saveToCache(newData);
    return newData;
  } finally {
    await releaseLock(lockId);
  }
}

5. Add observability around refresh times

We now record:

  • Query execution time
  • Cache refresh duration
  • Lock acquisition metrics
  • Alerts when a refresh exceeds a threshold

The goal is to catch slowdowns before a timeout happens.

📝 Key Takeaways

  • Engine upgrades can change execution plans dramatically. Always benchmark critical queries after major DB changes.
  • Cache invalidation strategies must assume that refresh can fail.
  • Serving stale‑but‑valid data is often better than serving errors.
  • Distributed locks are essential in preventing cache stampedes.

🚀 Final Thoughts

The incident was stressful, but the learnings were worth it. Caching problems rarely show up during normal traffic—they appear right when your system is busiest. If you have a similar “delete‑then‑refresh” pattern somewhere in your application, review it before it reviews you.

Back to Blog

Related posts

Read more »