[Paper] Designing Scalable Rate Limiting Systems: Algorithms, Architecture, and Distributed Solutions

Published: 3 days ago (February 12, 2026 at 04:11 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.11741v1

Overview

Bo Guan’s paper tackles a problem that every modern API provider wrestles with: how to enforce rate limits that stay accurate, highly available, and horizontally scalable in a distributed environment. By marrying a classic sliding‑window algorithm with Redis’s sorted‑set data structure and Lua scripting, the author delivers a production‑ready design that can be dropped into large‑scale services without sacrificing latency or consistency.

Key Contributions

Concrete architecture for a distributed rate‑limiting service built on Redis (single‑node and Redis Cluster).
Quantitative analysis of the trade‑offs among three popular algorithms—Fixed Window, Token Bucket, and Rolling (Sliding) Window—focusing on accuracy vs. memory footprint.
Lua‑scripted atomic operations that combine cleanup, counting, and insertion, eliminating race conditions in high‑concurrency scenarios.
Three‑layer rule‑management model that separates rule storage, rule compilation (script hashing), and enforcement, enabling hot‑swap of rate‑limit policies without service restarts.
CAP‑theoretic justification for choosing Availability + Partition‑tolerance (AP) over strict consistency, with a discussion of how eventual consistency is acceptable for rate limiting.

Methodology

Algorithm selection – The author implements a Rolling Window algorithm because it offers the best balance of precision (no burst‑spike “leakage”) while keeping per‑client state modest.
Data structure choice – Redis Sorted Sets (ZSETs) store timestamps as scores and request identifiers as members. Insertion, range‑query, and deletion all run in O(log N), where N is the number of recent requests for a key.
Atomic enforcement – A Lua script runs server‑side, performing three steps in one transaction:
- Remove entries older than the window (cleanup).
- Count remaining entries (current usage).
- Insert the new request if the limit isn’t exceeded.
  This guarantees that concurrent requests can’t interleave and cause over‑counting.
Rule management – Rate‑limit policies are stored in a separate Redis hash. When a rule changes, the system re‑hashes the rule parameters and loads a new Lua script version, leaving the existing cached script untouched.
Scalability testing – The design is deployed on a Redis Cluster with sharding and replica nodes. Experiments measure latency, memory usage, and error rates under varying request loads and cluster sizes.

Results & Findings

Metric	Fixed Window	Token Bucket	Rolling Window (proposed)
Max error (requests over limit)	Up to 100 % burst	Up to 50 % burst	< 1 % (practically exact)
Memory per client (bytes)	~8	~12	~16 (due to timestamp storage)
Operation latency (µs)	30‑40	35‑45	45‑55 (still sub‑millisecond)
Scalability on 6‑node cluster	Linear up to 10 k QPS	Linear up to 12 k QPS	Linear up to 15 k QPS

Key takeaways: the Rolling Window delivers near‑perfect accuracy with only a modest increase in memory and latency, and it scales cleanly across Redis shards. The Lua‑scripted atomic path eliminates race conditions that plagued earlier implementations.

Practical Implications

API Gateways & Edge Services – Plug‑in the Lua script into existing Redis‑backed gateways (e.g., Kong, Envoy) to get precise per‑client throttling without redesigning the request path.
Microservice Meshes – Because the solution is AP‑oriented, it tolerates network partitions; services continue to enforce limits locally, and eventual convergence prevents “soft‑locks.”
Cost‑Effective Scaling – Using Redis’s native sharding means you can add nodes to handle higher QPS without rewriting business logic; the same Lua script works across the cluster.
Dynamic Policy Updates – Ops teams can adjust limits on‑the‑fly (e.g., during a product launch) by updating the rule hash; the system automatically picks up the new script version, avoiding downtime.
Observability – The design naturally exposes counters (current window size) that can be scraped for dashboards, enabling real‑time alerting on abuse spikes.

Limitations & Future Work

Eventual Consistency – In rare partition scenarios, a client might briefly exceed its quota before replicas reconcile; the paper notes this as an acceptable risk but suggests tighter sync for high‑value APIs.
Memory Growth for Hot Keys – Extremely bursty clients can inflate the sorted set size; adaptive bucketization or hybrid token‑bucket fallback could mitigate this.
Multi‑dimensional Limits – Current design handles a single dimension (e.g., requests per second). Extending to composite limits (IP + user + endpoint) would require additional indexing strategies.
Benchmark Diversity – Tests focus on synthetic workloads; future work could evaluate real‑world traffic patterns, including long‑tail distributions and mixed read/write mixes.

Bottom line: Guan’s architecture demonstrates that with the right combination of data structures, scripting, and cluster design, you can build a rate limiter that is precise, low‑latency, and ready for the cloud‑native scale that modern developers demand.

Authors

Bo Guan

Paper Information

arXiv ID: 2602.11741v1
Categories: cs.DC, cs.DB, cs.PF, cs.SE
Published: February 12, 2026
PDF: Download PDF

[Paper] Designing Scalable Rate Limiting Systems: Algorithms, Architecture, and Distributed Solutions

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Legitimate Overrides in Decentralized Protocols

[Paper] OServe: Accelerating LLM Serving via Spatial-Temporal Workload Orchestration

[Paper] Contention Resolution, With and Without a Global Clock

[Paper] An Auction-Based Mechanism for Optimal Task Allocation and Resource Aware Containerization