Why Retry Is One Of The Most Dangerous Keywords In Software

Published: 1 day ago (June 13, 2026 at 02:27 AM EDT)

5 min read

Source: Dev.to

Few lines of code look more innocent than this: retry(3)

It feels responsible. Professional. Resilient. After all, networks fail. Servers become unavailable. Databases occasionally time out. Retrying seems like the obvious solution. And sometimes it is. But after enough years building production systems, I’ve become convinced of something: Retry is one of the most dangerous keywords in software. Not because retries are bad. Because retries amplify everything. Good systems become more reliable. Bad systems become disasters. The problem is that many developers treat retries as a reliability feature when they’re actually a distributed systems feature. And distributed systems are where simple ideas go to become complicated. Imagine: await fetch(“/api/users”);

The request fails. Maybe: Network hiccup Temporary database issue Load balancer restart Service deployment The operation might succeed if attempted again. So we write: retry(3)

Seems reasonable. And in many cases: It Works

Which is why retries become popular. Most developers unconsciously assume: Failure

Operation Did Not Execute

Unfortunately that’s not always true. A request can: Execute Successfully ↓ Response Never Arrives

From the client’s perspective: Failure

From the server’s perspective: Success

Now a retry becomes dangerous. Imagine a payment service. await chargeCard(order);

The card processor successfully charges: $100

The response is lost due to a network issue. Client sees: Request Failed

and retries. await chargeCard(order);

again. Now: Charge #1 = Success Charge #2 = Success

The customer paid twice. Nobody wrote bad logic. The retry created the bug. Consider: await sendWelcomeEmail(user);

Email provider accepts the message. Response times out. Application retries. await sendWelcomeEmail(user);

again. Customer receives: Welcome! Welcome! Welcome! Welcome!

Support ticket created. Marketing team confused. The retry succeeded. Too well. This is the core issue. Pure operations: 2 + 2

can run forever. Nothing changes. Side effects are different. Examples: Charge Card Create Order Send Email Book Seat Reserve Inventory Send SMS

Each execution changes reality. Retries repeat reality. And reality doesn’t always appreciate repetition. One failed request isn’t scary. Ten thousand retries are. Imagine: Service A

becomes slow. Clients start retrying. Traffic doubles. Service becomes slower. More retries occur. Traffic doubles again. Eventually: Small Failure ↓ Massive Outage

This is known as: The Thundering Herd Problem

And retries are often the cause. Suppose: Database

is under heavy load. Queries start timing out. Application retries automatically. Now: More Queries ↓ More Load ↓ More Timeouts ↓ More Retries

You have accidentally built a denial-of-service attack against your own database. In the previous article we discussed: Idempotency

This is where it becomes critical. Without idempotency: Retry

Repeat Side Effects

With idempotency: Retry

Same Result

A retry becomes safe. That’s why reliable systems almost always combine: Retries + Idempotency

rather than using retries alone. A common mistake: retry(3)

for every error. Consider: 400 Bad Request

Retrying won’t help. The request is invalid. Or: 401 Unauthorized

Retrying won’t magically authenticate the user. Good retry policies distinguish between: Transient Failures

and Permanent Failures

Bad: Retry Immediately Retry Immediately Retry Immediately

Better: 1 Second ↓ 2 Seconds ↓ 4 Seconds ↓ 8 Seconds

This is: Exponential Backoff

and it prevents systems from overwhelming already struggling services. Imagine: Reserve Seat

times out. Client retries. Without protection: Seat Reserved Twice

or: Two Different Seats Reserved

Now inventory becomes inconsistent. Airlines spend enormous effort preventing these scenarios. Because retries happen constantly. Webhook providers often retry automatically. For example: Payment Completed

may arrive: 1 Time 2 Times 5 Times

depending on delivery conditions. Systems that assume: Exactly Once

processing usually fail eventually. Systems that expect retries survive. Kafka. RabbitMQ. SQS. Azure Service Bus. All assume: Messages May Be Delivered Again

because reliability is more important than uniqueness. Consumers must be designed accordingly. Not every failure is recoverable. Often makes outages worse. Creates duplicate side effects. Eventually becomes infinite damage. Retries should not become a substitute for monitoring. Transient failures disappear. Temporary outages become invisible. Systems tolerate instability. Many failures self-heal. Network failures become manageable. Without idempotency. Can worsen outages. One issue spreads across systems. Backoff strategies become necessary. Retries can mask deeper issues. Most developers think retries exist to make software more reliable. That’s only partially true. Retries don’t eliminate failures. They change failures. Sometimes they transform: Temporary Network Problem

into: Duplicate Payment

Sometimes they transform: Slow Database

into: Full System Outage

That’s why experienced engineers don’t ask: Should We Retry?

They ask: What Happens If This Operation Executes Twice?

Because once retries enter the picture, duplicate execution is no longer an edge case. It’s a certainty. And reliable systems are designed with that reality in mind. In the next article we’ll discuss: The Myth Of Stateless Systems Because many systems described as “stateless” are actually storing state somewhere else. And that distinction turns out to be extremely important. Hi, I’m Amrish Khan. I enjoy building developer tools, exploring software architecture, and writing about the deeper ideas behind everyday programming concepts. I’m also building Aruvix — a growing ecosystem of local-first developer tools designed to process data directly in the browser without unnecessary uploads. Here’s a detailed blog on Aruvix: https://dev.to/amrishkhan05/aruvix-the-ultimate-offline-first-developer-toolkit-e0i You can follow my work and thoughts here: Portfolio: https://www.amrishkhan.dev LinkedIn: https://www.linkedin.com/in/amrishkhan GitHub: https://www.github.com/amrishkhan05 If you enjoyed this article, consider following for more deep dives into JavaScript, architecture, local-first software, and performance engineering.

Why Retry Is One Of The Most Dangerous Keywords In Software

Which is why retries become popular. Most developers unconsciously assume: Failure

This is where it becomes critical. Without idempotency: Retry

With idempotency: Retry

Related posts

Launching BonVoyage: From Travel Problem to Public Launch

The spec is in the wrong place

Incident Automation: What to Automate, What to Leave to Humans

The Heuristics Say Don't