Circuit Breakers: The Unsung Heroes of Resilient Microservices
Source: Dev.to
Why Circuit Breakers Matter
When you’re running multiple services in production, failures are unavoidable. A downstream service might spike latency, return 500s, or disappear entirely. Without protection, a single fault can cascade across your system, wasting threads, exhausting connection pools, and eventually taking down dependent services. This is where circuit breakers shine—they degrade gracefully instead of amplifying failure.
You’ve probably used timeouts and retries, but those alone aren’t enough. Retries exacerbate overload, and timeouts still waste resources waiting. A circuit breaker monitors failures, and when they cross a threshold, it short‑circuits the call, returning a predefined fallback immediately. This stops your service from burning CPU on doomed requests and lets downstream recover under reduced load.
Circuit Breaker State Machine
The state machine is simple:
| State | Behavior |
|---|---|
| Closed | Normal operation; every call is passed through. Failures increment a counter. |
| Open | Calls are rejected fast without reaching the remote service. |
| Half‑Open | After a configurable timeout, a few probe requests are allowed. If they succeed, the breaker resets to Closed; otherwise it returns to Open. |
Typical thresholds:
- Failure ratio exceeds a configured limit (e.g., 50 % of the last 10 calls) → transition to Open.
- After the timeout expires → transition to Half‑Open.
- Successful probes → transition back to Closed.
Implementation Example (Go)
Libraries like gobreaker (Go) or resilience4j (Java) abstract the boilerplate. Below is a concise example using gobreaker:
package main
import (
"fmt"
"io"
"net/http"
"time"
"github.com/sony/gobreaker"
)
var cb *gobreaker.CircuitBreaker
func init() {
cb = gobreaker.NewCircuitBreaker(gobreaker.Settings{
Name: "user-svc",
MaxRequests: 3, // allowed in half‑open state
Interval: 30 * time.Second, // reset stats interval
Timeout: 10 * time.Second, // time spent in open state
ReadyToTrip: func(c gobreaker.Counts) bool {
// Trip when ≥5 requests and failure ratio > 50 %
return c.Requests >= 5 && float64(c.TotalFailures)/float64(c.Requests) > 0.5
},
})
}
// FetchUser retrieves a user, applying the circuit breaker.
func FetchUser(id string) (string, error) {
result, err := cb.Execute(func() (interface{}, error) {
resp, err := http.Get("http://user-service/" + id)
if err != nil {
return nil, err
}
defer resp.Body.Close()
if resp.StatusCode >= 500 {
return nil, fmt.Errorf("upstream error: %d", resp.StatusCode)
}
body, err := io.ReadAll(resp.Body)
if err != nil {
return nil, err
}
return string(body), nil
})
if err != nil {
return "", err // caller can choose a fallback
}
return result.(string), nil
}
The snippet tracks failures over 30‑second windows. After 5 requests with a ≥ 50 % failure rate, it opens for 10 seconds. During that window, Execute returns immediately, preserving your resources. The half‑open probe allows 3 requests to verify recovery.
Advanced Usage
- Bulkheads – Limit the number of concurrent calls per breaker so a misbehaving service doesn’t exhaust your entire thread pool.
- Selective Retries – Apply retries only for transient errors (e.g., 429 or 503) and keep them out of the breaker’s failure count to avoid premature trips.
- Per‑dependency Settings – Critical services can tolerate more failures than low‑impact endpoints; tune thresholds accordingly.
Monitoring and Observability
- Emit logs and metrics on every state change.
- Track:
- Trip rates
- Operation latency
- Fallback invocations
Use the data to adjust thresholds or investigate upstream health. If your breaker trips too often, either the threshold is too low or the upstream service is genuinely broken.
Common Pitfalls
- Ignoring half‑open failures – Treat them as normal failures; otherwise the breaker may stay open indefinitely.
- Coupling retries with the breaker’s failure count – This can cause the breaker to open on retry‑induced errors.
- Setting timeouts too short – May classify healthy latency spikes as failures.
- Not resetting statistics – Without an appropriate
Interval, old failures can keep the breaker open longer than needed.