Why Retry Is One Of The Most Dangerous Keywords In Software
Source: Dev.to
Few lines of code look more innocent than this: retry(3)
It feels responsible. Professional. Resilient. After all, networks fail. Servers become unavailable. Databases occasionally time out. Retrying seems like the obvious solution. And sometimes it is. But after enough years building production systems, I’ve become convinced of something: Retry is one of the most dangerous keywords in software. Not because retries are bad. Because retries amplify everything. Good systems become more reliable. Bad systems become disasters. The problem is that many developers treat retries as a reliability feature when they’re actually a distributed systems feature. And distributed systems are where simple ideas go to become complicated. Imagine: await fetch(“/api/users”);
The request fails. Maybe: Network hiccup Temporary database issue Load balancer restart Service deployment The operation might succeed if attempted again. So we write: retry(3)
Seems reasonable. And in many cases: It Works
Which is why retries become popular. Most developers unconsciously assume: Failure
Operation Did Not Execute
Unfortunately that’s not always true. A request can: Execute Successfully ↓ Response Never Arrives
From the client’s perspective: Failure
From the server’s perspective: Success
Now a retry becomes dangerous. Imagine a payment service. await chargeCard(order);
The card processor successfully charges: $100
The response is lost due to a network issue. Client sees: Request Failed
and retries. await chargeCard(order);
again. Now: Charge #1 = Success Charge #2 = Success
The customer paid twice. Nobody wrote bad logic. The retry created the bug. Consider: await sendWelcomeEmail(user);
Email provider accepts the message. Response times out. Application retries. await sendWelcomeEmail(user);
again. Customer receives: Welcome! Welcome! Welcome! Welcome!
Support ticket created. Marketing team confused. The retry succeeded. Too well. This is the core issue. Pure operations: 2 + 2
can run forever. Nothing changes. Side effects are different. Examples: Charge Card Create Order Send Email Book Seat Reserve Inventory Send SMS
Each execution changes reality. Retries repeat reality. And reality doesn’t always appreciate repetition. One failed request isn’t scary. Ten thousand retries are. Imagine: Service A
becomes slow. Clients start retrying. Traffic doubles. Service becomes slower. More retries occur. Traffic doubles again. Eventually: Small Failure ↓ Massive Outage
This is known as: The Thundering Herd Problem
And retries are often the cause. Suppose: Database
is under heavy load. Queries start timing out. Application retries automatically. Now: More Queries ↓ More Load ↓ More Timeouts ↓ More Retries
You have accidentally built a denial-of-service attack against your own database. In the previous article we discussed: Idempotency
This is where it becomes critical. Without idempotency: Retry
Repeat Side Effects
With idempotency: Retry
Same Result
A retry becomes safe. That’s why reliable systems almost always combine: Retries + Idempotency
rather than using retries alone. A common mistake: retry(3)
for every error. Consider: 400 Bad Request
Retrying won’t help. The request is invalid. Or: 401 Unauthorized
Retrying won’t magically authenticate the user. Good retry policies distinguish between: Transient Failures
and Permanent Failures
Bad: Retry Immediately Retry Immediately Retry Immediately
Better: 1 Second ↓ 2 Seconds ↓ 4 Seconds ↓ 8 Seconds
This is: Exponential Backoff
and it prevents systems from overwhelming already struggling services. Imagine: Reserve Seat
times out. Client retries. Without protection: Seat Reserved Twice
or: Two Different Seats Reserved
Now inventory becomes inconsistent. Airlines spend enormous effort preventing these scenarios. Because retries happen constantly. Webhook providers often retry automatically. For example: Payment Completed
may arrive: 1 Time 2 Times 5 Times
depending on delivery conditions. Systems that assume: Exactly Once
processing usually fail eventually. Systems that expect retries survive. Kafka. RabbitMQ. SQS. Azure Service Bus. All assume: Messages May Be Delivered Again
because reliability is more important than uniqueness. Consumers must be designed accordingly. Not every failure is recoverable. Often makes outages worse. Creates duplicate side effects. Eventually becomes infinite damage. Retries should not become a substitute for monitoring. Transient failures disappear. Temporary outages become invisible. Systems tolerate instability. Many failures self-heal. Network failures become manageable. Without idempotency. Can worsen outages. One issue spreads across systems. Backoff strategies become necessary. Retries can mask deeper issues. Most developers think retries exist to make software more reliable. That’s only partially true. Retries don’t eliminate failures. They change failures. Sometimes they transform: Temporary Network Problem
into: Duplicate Payment
Sometimes they transform: Slow Database
into: Full System Outage
That’s why experienced engineers don’t ask: Should We Retry?
They ask: What Happens If This Operation Executes Twice?
Because once retries enter the picture, duplicate execution is no longer an edge case. It’s a certainty. And reliable systems are designed with that reality in mind. In the next article we’ll discuss: The Myth Of Stateless Systems Because many systems described as “stateless” are actually storing state somewhere else. And that distinction turns out to be extremely important. Hi, I’m Amrish Khan. I enjoy building developer tools, exploring software architecture, and writing about the deeper ideas behind everyday programming concepts. I’m also building Aruvix — a growing ecosystem of local-first developer tools designed to process data directly in the browser without unnecessary uploads. Here’s a detailed blog on Aruvix: https://dev.to/amrishkhan05/aruvix-the-ultimate-offline-first-developer-toolkit-e0i You can follow my work and thoughts here: Portfolio: https://www.amrishkhan.dev LinkedIn: https://www.linkedin.com/in/amrishkhan GitHub: https://www.github.com/amrishkhan05 If you enjoyed this article, consider following for more deep dives into JavaScript, architecture, local-first software, and performance engineering.