Retry Strategies Compared: Constant vs Exponential Backoff vs Jitter in Go (With Simulation)

Published: (January 31, 2026 at 01:42 PM EST)
6 min read
Source: Dev.to

Source: Dev.to

The Thundering Herd Problem

Your server goes down. There are 1 000 clients already waiting. What happens when it comes back?

Imagine a server that can handle 200 requests / second. A dependency goes down for about 10 seconds; every in‑flight request fails, and all clients start retrying.

When the server recovers, all 1 000 clients retry at once. This is called the thundering herd problem, and your retry strategy determines whether the server recovers in seconds or drowns under a wave of redundant/unnecessary requests.

Most engineers know they should use exponential backoff, but fewer realize it can still cause synchronized spikes. Even fewer have seen what decorrelated jitter actually does to the request‑distribution curve.

What I Built

I created a simulation of 1 000 concurrent Go clients, a server with fixed capacity, and four retry strategies fighting for recovery. The simulation mimics a production scenario:

  • Fixed server capacity
  • A hard outage window
  • Clients that keep retrying until they succeed

Every request is tracked per‑second, so you can see the thundering herd form, peak, and (hopefully) dissipate.

Strategy Signature

All strategies share the same signature:

type Strategy func(attempt int, prevDelay time.Duration) time.Duration

Retry Strategies

1. Constant Retry (no backoff)

// Fixed delay, no backoff.
func constantRetry(_ int, _ time.Duration) time.Duration {
    return 1 * time.Millisecond
}

Formula: delay = 1 ms (constant)

The worst‑case pattern: a tiny fixed delay with no backoff. It is what happens when engineers write a quick retry loop without thinking about backoff.

2. Exponential Backoff

// Double the delay on each attempt, starting at 100 ms and capped at 10 s.
func exponentialBackoff(attempt int, _ time.Duration) time.Duration {
    d := time.Duration(float64(baseSleep) * math.Pow(2, float64(attempt)))
    if d > capSleep {
        d = capSleep
    }
    return d
}

Formula: delay = min(base * 2^attempt, cap)

Better than constant retry, but there is a catch. All 1 000 clients start at the same time, so they all hit attempt 1 at t = 100 ms, attempt 2 at t = 300 ms, attempt 3 at t = 700 ms, etc.

3. Full Jitter

// Randomize the delay between 0 and the exponential backoff value.
func fullJitter(attempt int, _ time.Duration) time.Duration {
    d := time.Duration(float64(baseSleep) * math.Pow(2, float64(attempt)))
    if d > capSleep {
        d = capSleep
    }
    return time.Duration(rand.Int63n(int64(d)))
}

Formula: delay = random(0, min(base * 2^attempt, cap))

This strategy is described in the AWS Architecture Blog and typically yields the fewest total calls. Because each client picks a random delay, they no longer retry at the same time. Instead of 1 000 clients all hitting the server at once, they arrive spread out.

4. Decorrelated Jitter

// Each delay is random between base and 3 * previous_delay.
func decorrelatedJitter(_ int, prev time.Duration) time.Duration {
    if prev  capSleep {
        d = capSleep
    }
    return d
}

Formula: delay = random(base, prev * 3) (capped)

This also comes from the AWS blog. Instead of using the attempt number, each delay depends on the previous delay. The multiplier (3) is a practical choice from AWS; it controls how fast delays grow. With 3, the average delay grows about 1.5× per retry (the midpoint of random(base, prev*3)). That’s enough to spread clients out without making them wait too long. You could use 2 or 4 and still work, just with different trade‑offs.

Server Model

type Server struct {
    mu       sync.Mutex
    capacity int           // requests per second the server can handle
    downFor  time.Duration // outage duration
    start    time.Time
    requests map[int]int // second -> total request count
    accepted map[int]int // second -> accepted request count
}

The server counts how many requests it receives each second. If it is still in the outage window or already at capacity, it rejects the request.

Client Loop

func clientLoop(srv *Server, strategy Strategy, metrics *Metrics) {
    start := time.Now()
    attempt := 0
    prevDelay := baseSleep

    for {
        if srv.Do() { // success
            metrics.Record(time.Since(start), attempt)
            return
        }
        delay := strategy(attempt, prevDelay)
        prevDelay = delay
        attempt++
        time.Sleep(delay)
    }
}

Each client runs in its own goroutine and keeps retrying until it gets a success.

Simulation Parameters

ParameterValue
Number of clients1 000 (all start simultaneously)
Server capacity200 req/s
Outage duration10 seconds
Request handlingReject during outage or when at capacity
Metrics collectedRequests per second (histogram), total wasted requests, p99 latency

The simulation draws a bar chart in the terminal so you can literally see the thundering herd.

Observations

Constant Retry

  • Each client retries every 1 ms.
  • With 1 000 clients, the server receives hundreds of thousands of requests per second, but can only handle 200.
  • Over the test run we observed >8 million wasted requests to serve the 1 000 clients.
  • The flood continues for several seconds after the server recovers; it only stops when each client finally succeeds.

Histogram: Spikes at seconds 0, 1, 3, 6, 12, 22, 32, 42, 52 with nothing in between. All clients double their delay at the same rate, causing synchronized mini‑herds.

Exponential Backoff

  • Similar spike pattern as constant retry, but the gaps between spikes grow.
  • Still a noticeable herd effect; total wasted requests drop but remain high.

Full Jitter

  • The histogram shows a smooth curve decreasing from ~5 000 requests at second 0 to ~13 at second 19.
  • No sharp spikes – clients are spread out.
  • Wasted requests: ~8 468 (≈0.1 % of constant‑retry case).
  • p99 latency: 52 seconds (still high because some clients get unlucky long delays).

Decorrelated Jitter

  • Histogram is also smooth, similar to full jitter, with a bit more noise.
  • Wasted requests: ~10 695 (slightly higher than full jitter).
  • Requests over capacity: 137 (due to occasional short delays that cascade).

Why? If a client randomly gets a short delay, the next delay is based on that short value, allowing the client to retry quickly for a few attempts before the jitter widens.

Takeaways

StrategySpikes?Total Wasted Requestsp99 Latency
Constant RetryYes> 8 M> 50 s
Exponential BackoffYes~ 1 M> 40 s
Full JitterNo~ 8.5 k52 s
Decorrelated JitterNo~ 10.7 k~ 45 s
  • Constant and exponential backoff produce synchronized spikes (the classic thundering herd).
  • Full jitter eliminates spikes but can still give some clients very long delays, inflating tail latency.
  • Decorrelated jitter spreads retries while keeping the average delay growth moderate, offering a good trade‑off between request‑rate smoothing and latency.

References

  • AWS Architecture Blog – “Exponential Backoff & Jitter” (covers full jitter and decorrelated jitter).
  • Google Cloud Blog – “Retry Strategies for Distributed Systems”.

Feel free to adapt the code and parameters to your own environment. The core idea is simple: add randomness to retry intervals to break up synchronized bursts.

Overview

Full jitter outperformed every other strategy in this simulation.
The test, however, represents a worst‑case scenario where all 1,000 clients fail simultaneously – an unlikely pattern in production. In real deployments, client failures are staggered, and decorrelated jitter tends to handle that situation more gracefully because each client determines its own retry rhythm instead of following a single, synchronized schedule.

Strategy Summary

StrategyProduction ViabilityRemarks
Constant retry❌ Never recommended in productionLeads to massive thundering‑herd problems.
Exponential backoff (no jitter)⚠️ Only for very few clientsSynchronisation issues appear once the herd grows.
Full jitter✅ Default choiceBest results in the simulation; simplest to implement.
Decorrelated jitter✅ Recommended for staggered failuresSimilar performance to full jitter, but adapts per‑client based on its own history, making it more robust when failures are spread over time.

How to Run the Simulator

# Clone the repository
git clone https://github.com/RafaelPanisset/retry-strategies-simulator
cd retry-strategies-simulator

# Run each strategy (each execution takes ~15–60 seconds)
go run main.go -strategy=constant
go run main.go -strategy=backoff
go run main.go -strategy=jitter
go run main.go -strategy=decorrelated

During execution the program prints live ASCII histograms that visualise the distribution of retry intervals for the chosen strategy.

Additional References

Marc Brooker, “Exponential Backoff and Jitter”
Marc Brooker, “Timeouts, retries, and backoff with jitter”
Google Cloud, “Retry Strategy”

Takeaway: Use full jitter as the go‑to retry policy, but consider decorrelated jitter when you expect client failures to be spread out over time. This approach minimizes herd effects while keeping the implementation straightforward.

Back to Blog

Related posts

Read more »

Litestream Writable VFS

Article URL: https://fly.io/blog/litestream-writable-vfs/ Comments URL: https://news.ycombinator.com/item?id=46893167 Points: 5 Comments: 8...