Building a Resilience Engine in Python: Internals of LimitPal (Part 2)
Source: Dev.to
Overview
The executor pipeline, clock abstraction, and circuit‑breaker architecture are the core of LimitPal. The design follows a single execution pipeline where every call flows through the same stages in a fixed order:
- Circuit breaker
- Rate limiter
- Retry loop
- Result recording
This ordering is intentional: it fails fast when an upstream service is down, prevents breaker failures from consuming quota, ensures retries respect rate limits, and keeps burst behavior predictable.
Execution Pipeline
- Fail fast: The circuit breaker runs first. If the upstream service is unavailable, the call is rejected immediately, protecting the caller.
- Rate limiting: Only after the breaker permits execution does the rate limiter apply, guaranteeing that retries stay within the allotted quota.
- Retry inside the limiter window: Retries occur within the rate‑limiting window, treating each retry as a budgeted operation. This keeps the system stable under stress.
- Result recording: Finally, the outcome is recorded for observability and feedback to the breaker.
Why a Single Pipeline?
Individual decorators often own their own time model, retry logic, and failure semantics. Stacking them leads to emergent, unintended behavior. The executor enforces:
- A shared clock
- A shared failure model
- A shared execution lifecycle
This makes the system predictable and easier to reason about.
Clock Abstraction
Time is the hardest dependency in resilience systems. LimitPal replaces direct calls to time.time() with a pluggable clock interface:
class Clock(Protocol):
def now(self) -> float: ...
def sleep(self, seconds: float) -> None: ...
async def sleep_async(self, seconds: float) -> None: ...
All components use this clock, which is based on monotonic time to avoid issues with system clock jumps, NTP adjustments, or container migrations.
Testing Benefits
clock.advance(5.0) # fast‑forward 5 seconds without real waiting
Tests become deterministic and fast, allowing simulation of minutes of retry behavior instantly.
Circuit Breaker Architecture
The breaker is a state machine:
CLOSED → OPEN → HALF_OPEN → CLOSED
Normal Operation (CLOSED)
- Failures increment a counter.
- When the failure threshold is reached, the breaker transitions to OPEN.
OPEN State
- All calls fail immediately—no retries—providing fast rejection.
HALF_OPEN State
- After a recovery timeout, a limited number of probe calls are allowed.
- If they succeed, the breaker returns to CLOSED; otherwise, it goes back to OPEN.
This discipline prevents retry storms after recovery and acts as a stability regulator.
Jitter for Exponential Backoff
Without jitter, thousands of clients retrying simultaneously can cause synchronized spikes that overwhelm the service. Adding a small random offset spreads retries over time:
- Without jitter: all retries at
t = 1 s - With jitter: retries in
[0.9 s, 1.1 s]
The randomness yields a large stability gain.
Rate Limiting
Limiters operate per key (e.g., user:123, tenant:acme, ip:10.0.0.1). Each key gets its own bucket, preventing a single bad actor from exhausting the global quota.
Internally this requires:
- Dynamic bucket allocation
- TTL eviction
- Bounded memory usage
- Optional LRU trimming
Sync and Async Parity
LimitPal provides a unified API for both synchronous and asynchronous execution:
executor.run(...) # sync
await executor.run(...) # async
There are no hidden behavioral differences, allowing the same mental model across background workers, HTTP servers, and CLI tools.
Planned Work
- Observability hooks
- Adaptive rate limiting
- Redis backend support
- Bulkhead pattern implementation
- Integrations with popular frameworks
Resilience does not end at execution; distributed systems fail, and engineered failure behavior is essential.
References
- Documentation:
- Source code:
Feedback is welcome, especially from those interested in deep infrastructure tools.