The Case for Leaky Locks: Redis TTL as Failure Cooldown for Expensive AI Jobs
Source: Dev.to
The Problem I Didn’t See Coming: How Releasing Locks Cost Me Money
My job queue was simple: a user submits a document → an AI evaluates it → the result is stored.
AI calls can fail—rarely, but they do. A model might ignore the expected output format or a rate limit can kick in. I caught the exception, logged it, marked the job as failed, and released the lock in the finally block.
When the user retried, the AI failed again, leading to a retry storm that hammered the LLM API and racked up real costs.
Using Lock Expiration as a Cooldown Mechanism
# Acquire a lock with a 5‑minute TTL
lock_key = f"ai_job_lock:{job_id}"
acquired = redis.set(lock_key, "1", nx=True, ex=300) # nx=True → set only if not exists
# If the lock already exists, the job is either running or cooling down
if not acquired:
return
try:
result = await call_llm_api(data)
save_result(result)
# Release lock only on success
redis.delete(lock_key)
except Exception as e:
log.error(f"Failed: {e}")
# Do NOT release the lock; the TTL provides the cooldown windowThe lock simply stays for five minutes, after which Redis automatically evicts it. This isn’t a memory leak—it’s a self‑destructing key.
Without TTL: fail → retry → retry → retry → 20 calls in 60 s
With TTL: fail → blocked → blocked → retry at t=5 minWhy This Works
- Rate limiting – immediate retries will hit the same limit.
- Network hiccups – often resolve within minutes, not instantly.
- Prompt problems – won’t fix themselves regardless of retry frequency.
When the system is already under strain, adding more expensive LLM calls only worsens the situation. A TTL‑based cooldown gives the external service time to recover while keeping the user‑visible latency low.
The Part I Actually Like
There’s no complex retry logic, exponential backoff, or state machine—just time.
# All the code you need
acquired = redis.set(lock_key, "1", nx=True, ex=300)
# On failure, just let it ride; the lock expires naturallyIf a worker crashes mid‑job, the lock still expires, allowing a recovery service to pick up the job later. During the cooldown window, any new retry request simply fails to acquire the lock and exits early without invoking the LLM.
When This Doesn’t Apply
- Cheap operations where retries are essentially free.
- Jobs that must be retried immediately.
- Scenarios where users expect a synchronous, instant response.
Risks Worth Mentioning
Lock duration mismatch
If a job runs longer than the TTL, the lock may expire while the job is still active, allowing another worker to pick it up. Ensure the TTL exceeds the worst‑case runtime or implement a heartbeat that refreshes the lock.
Deterministic failures
The cooldown helps with transient issues (rate limits, network glitches) but not with permanent problems like bad input or a broken prompt. Classify such failures as permanently failed rather than letting them loop every five minutes.
User‑facing feedback
If a user retries and sees no progress, it feels broken. Provide feedback such as “retry available in X minutes” or a job status indicator so users know the system is waiting rather than silently stuck.
Wrapping Up
We often spend a lot of effort building complex retry mechanisms, circuit breakers, and fallback systems. Sometimes the simplest answer is: let it fail, wait a bit, then try again later. A Redis TTL gives you that “wait a bit” automatically, with virtually no extra code and no risk of accidental bypass.
Curious if anyone else has tried this or has a better approach—feel free to share!