Retry Strategies Compared: Constant vs Exponential Backoff vs Jitter in Go (With Simulation)
Source: Dev.to
The Thundering Herd Problem
Your server goes down. There are 1 000 clients already waiting. What happens when it comes back?
Imagine a server that can handle 200 requests / second. A dependency goes down for about 10 seconds; every in‑flight request fails, and all clients start retrying.
When the server recovers, all 1 000 clients retry at once. This is called the thundering herd problem, and your retry strategy determines whether the server recovers in seconds or drowns under a wave of redundant/unnecessary requests.
Most engineers know they should use exponential backoff, but fewer realize it can still cause synchronized spikes. Even fewer have seen what decorrelated jitter actually does to the request‑distribution curve.
What I Built
I created a simulation of 1 000 concurrent Go clients, a server with fixed capacity, and four retry strategies fighting for recovery. The simulation mimics a production scenario:
- Fixed server capacity
- A hard outage window
- Clients that keep retrying until they succeed
Every request is tracked per‑second, so you can see the thundering herd form, peak, and (hopefully) dissipate.
Strategy Signature
All strategies share the same signature:
type Strategy func(attempt int, prevDelay time.Duration) time.Duration
Retry Strategies
1. Constant Retry (no backoff)
// Fixed delay, no backoff.
func constantRetry(_ int, _ time.Duration) time.Duration {
return 1 * time.Millisecond
}
Formula: delay = 1 ms (constant)
The worst‑case pattern: a tiny fixed delay with no backoff. It is what happens when engineers write a quick retry loop without thinking about backoff.
2. Exponential Backoff
// Double the delay on each attempt, starting at 100 ms and capped at 10 s.
func exponentialBackoff(attempt int, _ time.Duration) time.Duration {
d := time.Duration(float64(baseSleep) * math.Pow(2, float64(attempt)))
if d > capSleep {
d = capSleep
}
return d
}
Formula: delay = min(base * 2^attempt, cap)
Better than constant retry, but there is a catch. All 1 000 clients start at the same time, so they all hit attempt 1 at t = 100 ms, attempt 2 at t = 300 ms, attempt 3 at t = 700 ms, etc.
3. Full Jitter
// Randomize the delay between 0 and the exponential backoff value.
func fullJitter(attempt int, _ time.Duration) time.Duration {
d := time.Duration(float64(baseSleep) * math.Pow(2, float64(attempt)))
if d > capSleep {
d = capSleep
}
return time.Duration(rand.Int63n(int64(d)))
}
Formula: delay = random(0, min(base * 2^attempt, cap))
This strategy is described in the AWS Architecture Blog and typically yields the fewest total calls. Because each client picks a random delay, they no longer retry at the same time. Instead of 1 000 clients all hitting the server at once, they arrive spread out.
4. Decorrelated Jitter
// Each delay is random between base and 3 * previous_delay.
func decorrelatedJitter(_ int, prev time.Duration) time.Duration {
if prev capSleep {
d = capSleep
}
return d
}
Formula: delay = random(base, prev * 3) (capped)
This also comes from the AWS blog. Instead of using the attempt number, each delay depends on the previous delay. The multiplier (3) is a practical choice from AWS; it controls how fast delays grow. With 3, the average delay grows about 1.5× per retry (the midpoint of random(base, prev*3)). That’s enough to spread clients out without making them wait too long. You could use 2 or 4 and still work, just with different trade‑offs.
Server Model
type Server struct {
mu sync.Mutex
capacity int // requests per second the server can handle
downFor time.Duration // outage duration
start time.Time
requests map[int]int // second -> total request count
accepted map[int]int // second -> accepted request count
}
The server counts how many requests it receives each second. If it is still in the outage window or already at capacity, it rejects the request.
Client Loop
func clientLoop(srv *Server, strategy Strategy, metrics *Metrics) {
start := time.Now()
attempt := 0
prevDelay := baseSleep
for {
if srv.Do() { // success
metrics.Record(time.Since(start), attempt)
return
}
delay := strategy(attempt, prevDelay)
prevDelay = delay
attempt++
time.Sleep(delay)
}
}
Each client runs in its own goroutine and keeps retrying until it gets a success.
Simulation Parameters
| Parameter | Value |
|---|---|
| Number of clients | 1 000 (all start simultaneously) |
| Server capacity | 200 req/s |
| Outage duration | 10 seconds |
| Request handling | Reject during outage or when at capacity |
| Metrics collected | Requests per second (histogram), total wasted requests, p99 latency |
The simulation draws a bar chart in the terminal so you can literally see the thundering herd.
Observations
Constant Retry
- Each client retries every 1 ms.
- With 1 000 clients, the server receives hundreds of thousands of requests per second, but can only handle 200.
- Over the test run we observed >8 million wasted requests to serve the 1 000 clients.
- The flood continues for several seconds after the server recovers; it only stops when each client finally succeeds.
Histogram: Spikes at seconds 0, 1, 3, 6, 12, 22, 32, 42, 52 with nothing in between. All clients double their delay at the same rate, causing synchronized mini‑herds.
Exponential Backoff
- Similar spike pattern as constant retry, but the gaps between spikes grow.
- Still a noticeable herd effect; total wasted requests drop but remain high.
Full Jitter
- The histogram shows a smooth curve decreasing from ~5 000 requests at second 0 to ~13 at second 19.
- No sharp spikes – clients are spread out.
- Wasted requests: ~8 468 (≈0.1 % of constant‑retry case).
- p99 latency: 52 seconds (still high because some clients get unlucky long delays).
Decorrelated Jitter
- Histogram is also smooth, similar to full jitter, with a bit more noise.
- Wasted requests: ~10 695 (slightly higher than full jitter).
- Requests over capacity: 137 (due to occasional short delays that cascade).
Why? If a client randomly gets a short delay, the next delay is based on that short value, allowing the client to retry quickly for a few attempts before the jitter widens.
Takeaways
| Strategy | Spikes? | Total Wasted Requests | p99 Latency |
|---|---|---|---|
| Constant Retry | Yes | > 8 M | > 50 s |
| Exponential Backoff | Yes | ~ 1 M | > 40 s |
| Full Jitter | No | ~ 8.5 k | 52 s |
| Decorrelated Jitter | No | ~ 10.7 k | ~ 45 s |
- Constant and exponential backoff produce synchronized spikes (the classic thundering herd).
- Full jitter eliminates spikes but can still give some clients very long delays, inflating tail latency.
- Decorrelated jitter spreads retries while keeping the average delay growth moderate, offering a good trade‑off between request‑rate smoothing and latency.
References
- AWS Architecture Blog – “Exponential Backoff & Jitter” (covers full jitter and decorrelated jitter).
- Google Cloud Blog – “Retry Strategies for Distributed Systems”.
Feel free to adapt the code and parameters to your own environment. The core idea is simple: add randomness to retry intervals to break up synchronized bursts.
Overview
Full jitter outperformed every other strategy in this simulation.
The test, however, represents a worst‑case scenario where all 1,000 clients fail simultaneously – an unlikely pattern in production. In real deployments, client failures are staggered, and decorrelated jitter tends to handle that situation more gracefully because each client determines its own retry rhythm instead of following a single, synchronized schedule.
Strategy Summary
| Strategy | Production Viability | Remarks |
|---|---|---|
| Constant retry | ❌ Never recommended in production | Leads to massive thundering‑herd problems. |
| Exponential backoff (no jitter) | ⚠️ Only for very few clients | Synchronisation issues appear once the herd grows. |
| Full jitter | ✅ Default choice | Best results in the simulation; simplest to implement. |
| Decorrelated jitter | ✅ Recommended for staggered failures | Similar performance to full jitter, but adapts per‑client based on its own history, making it more robust when failures are spread over time. |
How to Run the Simulator
# Clone the repository
git clone https://github.com/RafaelPanisset/retry-strategies-simulator
cd retry-strategies-simulator
# Run each strategy (each execution takes ~15–60 seconds)
go run main.go -strategy=constant
go run main.go -strategy=backoff
go run main.go -strategy=jitter
go run main.go -strategy=decorrelated
During execution the program prints live ASCII histograms that visualise the distribution of retry intervals for the chosen strategy.
Additional References
Marc Brooker, “Exponential Backoff and Jitter”
Marc Brooker, “Timeouts, retries, and backoff with jitter”
Google Cloud, “Retry Strategy”
Takeaway: Use full jitter as the go‑to retry policy, but consider decorrelated jitter when you expect client failures to be spread out over time. This approach minimizes herd effects while keeping the implementation straightforward.