[Paper] Designing Scalable Rate Limiting Systems: Algorithms, Architecture, and Distributed Solutions
Source: arXiv - 2602.11741v1
Overview
Bo Guan’s paper tackles a problem that every modern API provider wrestles with: how to enforce rate limits that stay accurate, highly available, and horizontally scalable in a distributed environment. By marrying a classic sliding‑window algorithm with Redis’s sorted‑set data structure and Lua scripting, the author delivers a production‑ready design that can be dropped into large‑scale services without sacrificing latency or consistency.
Key Contributions
- Concrete architecture for a distributed rate‑limiting service built on Redis (single‑node and Redis Cluster).
- Quantitative analysis of the trade‑offs among three popular algorithms—Fixed Window, Token Bucket, and Rolling (Sliding) Window—focusing on accuracy vs. memory footprint.
- Lua‑scripted atomic operations that combine cleanup, counting, and insertion, eliminating race conditions in high‑concurrency scenarios.
- Three‑layer rule‑management model that separates rule storage, rule compilation (script hashing), and enforcement, enabling hot‑swap of rate‑limit policies without service restarts.
- CAP‑theoretic justification for choosing Availability + Partition‑tolerance (AP) over strict consistency, with a discussion of how eventual consistency is acceptable for rate limiting.
Methodology
- Algorithm selection – The author implements a Rolling Window algorithm because it offers the best balance of precision (no burst‑spike “leakage”) while keeping per‑client state modest.
- Data structure choice – Redis Sorted Sets (ZSETs) store timestamps as scores and request identifiers as members. Insertion, range‑query, and deletion all run in O(log N), where N is the number of recent requests for a key.
- Atomic enforcement – A Lua script runs server‑side, performing three steps in one transaction:
- Remove entries older than the window (cleanup).
- Count remaining entries (current usage).
- Insert the new request if the limit isn’t exceeded.
This guarantees that concurrent requests can’t interleave and cause over‑counting.
- Rule management – Rate‑limit policies are stored in a separate Redis hash. When a rule changes, the system re‑hashes the rule parameters and loads a new Lua script version, leaving the existing cached script untouched.
- Scalability testing – The design is deployed on a Redis Cluster with sharding and replica nodes. Experiments measure latency, memory usage, and error rates under varying request loads and cluster sizes.
Results & Findings
| Metric | Fixed Window | Token Bucket | Rolling Window (proposed) |
|---|---|---|---|
| Max error (requests over limit) | Up to 100 % burst | Up to 50 % burst | < 1 % (practically exact) |
| Memory per client (bytes) | ~8 | ~12 | ~16 (due to timestamp storage) |
| Operation latency (µs) | 30‑40 | 35‑45 | 45‑55 (still sub‑millisecond) |
| Scalability on 6‑node cluster | Linear up to 10 k QPS | Linear up to 12 k QPS | Linear up to 15 k QPS |
Key takeaways: the Rolling Window delivers near‑perfect accuracy with only a modest increase in memory and latency, and it scales cleanly across Redis shards. The Lua‑scripted atomic path eliminates race conditions that plagued earlier implementations.
Practical Implications
- API Gateways & Edge Services – Plug‑in the Lua script into existing Redis‑backed gateways (e.g., Kong, Envoy) to get precise per‑client throttling without redesigning the request path.
- Microservice Meshes – Because the solution is AP‑oriented, it tolerates network partitions; services continue to enforce limits locally, and eventual convergence prevents “soft‑locks.”
- Cost‑Effective Scaling – Using Redis’s native sharding means you can add nodes to handle higher QPS without rewriting business logic; the same Lua script works across the cluster.
- Dynamic Policy Updates – Ops teams can adjust limits on‑the‑fly (e.g., during a product launch) by updating the rule hash; the system automatically picks up the new script version, avoiding downtime.
- Observability – The design naturally exposes counters (current window size) that can be scraped for dashboards, enabling real‑time alerting on abuse spikes.
Limitations & Future Work
- Eventual Consistency – In rare partition scenarios, a client might briefly exceed its quota before replicas reconcile; the paper notes this as an acceptable risk but suggests tighter sync for high‑value APIs.
- Memory Growth for Hot Keys – Extremely bursty clients can inflate the sorted set size; adaptive bucketization or hybrid token‑bucket fallback could mitigate this.
- Multi‑dimensional Limits – Current design handles a single dimension (e.g., requests per second). Extending to composite limits (IP + user + endpoint) would require additional indexing strategies.
- Benchmark Diversity – Tests focus on synthetic workloads; future work could evaluate real‑world traffic patterns, including long‑tail distributions and mixed read/write mixes.
Bottom line: Guan’s architecture demonstrates that with the right combination of data structures, scripting, and cluster design, you can build a rate limiter that is precise, low‑latency, and ready for the cloud‑native scale that modern developers demand.
Authors
- Bo Guan
Paper Information
- arXiv ID: 2602.11741v1
- Categories: cs.DC, cs.DB, cs.PF, cs.SE
- Published: February 12, 2026
- PDF: Download PDF