The Case for Leaky Locks: Redis TTL as Failure Cooldown for Expensive AI Jobs

Published: 1 month ago (March 13, 2026 at 12:25 PM EDT)

4 min read

Source: Dev.to

Source: Dev.to

The Problem I Didn’t See Coming: How Releasing Locks Cost Me Money

My job queue was simple: a user submits a document → an AI evaluates it → the result is stored.
AI calls can fail—rarely, but they do. A model might ignore the expected output format or a rate limit can kick in. I caught the exception, logged it, marked the job as failed, and released the lock in the finally block.

When the user retried, the AI failed again, leading to a retry storm that hammered the LLM API and racked up real costs.

Using Lock Expiration as a Cooldown Mechanism

# Acquire a lock with a 5‑minute TTL
lock_key = f"ai_job_lock:{job_id}"
acquired = redis.set(lock_key, "1", nx=True, ex=300)  # nx=True → set only if not exists

# If the lock already exists, the job is either running or cooling down
if not acquired:
    return

try:
    result = await call_llm_api(data)
    save_result(result)

    # Release lock only on success
    redis.delete(lock_key)

except Exception as e:
    log.error(f"Failed: {e}")
    # Do NOT release the lock; the TTL provides the cooldown window

The lock simply stays for five minutes, after which Redis automatically evicts it. This isn’t a memory leak—it’s a self‑destructing key.

Without TTL:  fail → retry → retry → retry → 20 calls in 60 s
With TTL:     fail → blocked → blocked → retry at t=5 min

Why This Works

Rate limiting – immediate retries will hit the same limit.
Network hiccups – often resolve within minutes, not instantly.
Prompt problems – won’t fix themselves regardless of retry frequency.

When the system is already under strain, adding more expensive LLM calls only worsens the situation. A TTL‑based cooldown gives the external service time to recover while keeping the user‑visible latency low.

The Part I Actually Like

There’s no complex retry logic, exponential backoff, or state machine—just time.

# All the code you need
acquired = redis.set(lock_key, "1", nx=True, ex=300)

# On failure, just let it ride; the lock expires naturally

If a worker crashes mid‑job, the lock still expires, allowing a recovery service to pick up the job later. During the cooldown window, any new retry request simply fails to acquire the lock and exits early without invoking the LLM.

When This Doesn’t Apply

Cheap operations where retries are essentially free.
Jobs that must be retried immediately.
Scenarios where users expect a synchronous, instant response.

Risks Worth Mentioning

Lock duration mismatch

If a job runs longer than the TTL, the lock may expire while the job is still active, allowing another worker to pick it up. Ensure the TTL exceeds the worst‑case runtime or implement a heartbeat that refreshes the lock.

Deterministic failures

The cooldown helps with transient issues (rate limits, network glitches) but not with permanent problems like bad input or a broken prompt. Classify such failures as permanently failed rather than letting them loop every five minutes.

User‑facing feedback

If a user retries and sees no progress, it feels broken. Provide feedback such as “retry available in X minutes” or a job status indicator so users know the system is waiting rather than silently stuck.

Wrapping Up

We often spend a lot of effort building complex retry mechanisms, circuit breakers, and fallback systems. Sometimes the simplest answer is: let it fail, wait a bit, then try again later. A Redis TTL gives you that “wait a bit” automatically, with virtually no extra code and no risk of accidental bypass.

Curious if anyone else has tried this or has a better approach—feel free to share!

The Case for Leaky Locks: Redis TTL as Failure Cooldown for Expensive AI Jobs

The Problem I Didn’t See Coming: How Releasing Locks Cost Me Money

Using Lock Expiration as a Cooldown Mechanism

Why This Works

The Part I Actually Like

When This Doesn’t Apply

Risks Worth Mentioning

Lock duration mismatch

Deterministic failures

User‑facing feedback

Wrapping Up

Related posts

Why Open Source AI Tools Are Quietly Winning

Travigo

Trust Debt: The Production Crisis Hidden Inside AI-Generated Codebases

Micro games