[Paper] LeaseGuard: Raft Leases Done Right
Source: arXiv - 2512.15659v1
Overview
The paper presents LeaseGuard, a new lease‑based protocol that lets Raft leaders serve strongly consistent reads without the costly quorum round‑trip that most Raft deployments currently require. By exploiting properties that are unique to Raft elections, LeaseGuard achieves zero‑latency reads while keeping the system safe during leader changes—a long‑standing pain point for distributed databases.
Key Contributions
- A rigorously specified lease algorithm built on Raft’s election guarantees, formalized in TLA+.
- Two availability‑boosting optimizations:
- Rapid restoration of write throughput after a leader failover.
- Near‑instant read availability on a newly elected leader.
- Practical implementation in the LogCabin reference Raft codebase, demonstrating real‑world feasibility.
- Comprehensive evaluation (Python simulation + C++ prototype) showing:
- Consistent reads drop from one network round‑trip to zero.
- Write throughput climbs from ~1 k to ~10 k ops/sec.
- 99 % of reads succeed immediately after a leader change.
Methodology
- Problem framing – The authors dissect why existing Raft‑based systems either pay a per‑read quorum cost or use loosely defined leader leases that hurt availability.
- LeaseGuard design – They derive a lease invariant directly from Raft’s election safety property: a leader can safely claim a lease only if it knows that no other node can become leader before the lease expires. This eliminates the need for extra “lease‑grant” messages.
- Optimizations –
- Write‑throughput boost: When a leader steps down, the new leader pre‑emptively extends its lease using the term number, allowing pending writes to flow without waiting for a full election.
- Read‑availability boost: The new leader immediately serves reads for the majority of keys, deferring only those that might be in the “lease‑gap” window.
- Formal verification – The entire protocol is encoded in TLA+ and model‑checked to prove safety (no stale reads) and liveness (reads eventually succeed).
- Empirical evaluation –
- A Python event‑driven simulator explores a wide range of failure patterns and network latencies.
- A production‑grade implementation replaces LogCabin’s default quorum‑read path with LeaseGuard, measuring latency, throughput, and read‑availability during leader churn.
Results & Findings
| Metric | Traditional Raft (quorum reads) | LeaseGuard |
|---|---|---|
| Read latency | 1 network RTT (≈ 1 ms‑10 ms) | 0 RTT (local read) |
| Write throughput | ~1 k ops/s (limited by read‑write contention) | ~10 k ops/s (≈ 10× boost) |
| Read success after failover | ~0 % until lease expires (seconds) | ~99 % instantly |
| Safety | Proven by Raft’s original proof | Proven again in TLA+ (no stale reads) |
The data shows that LeaseGuard eliminates the read‑side bottleneck without compromising Raft’s strong consistency guarantees. Even under rapid leader failures, the system continues to serve reads almost immediately, a dramatic improvement over the “read‑pause” period of classic lease schemes.
Practical Implications
- Lower latency for read‑heavy workloads – Services like configuration stores, feature‑flag systems, or metadata layers can now serve reads locally on the leader, shaving off network latency entirely.
- Higher overall throughput – By decoupling reads from the quorum path, write pipelines stay saturated, which is especially valuable for micro‑service back‑ends that experience bursty write spikes.
- Simpler deployment – LeaseGuard’s specification is concrete and formally verified, reducing the risk of subtle bugs that plague ad‑hoc lease implementations. Teams can adopt it as a drop‑in replacement in existing Raft‑based stacks (e.g., etcd, Consul, LogCabin) with minimal code changes.
- Improved availability during failover – Cloud‑native operators often worry about “read‑downtime” when a leader crashes; LeaseGuard keeps the service responsive, easing SLA compliance.
- Foundation for hybrid consistency models – Because reads are now cheap, developers can more easily build read‑optimistic caches or combine strong reads with eventual‑consistent replicas without a separate read‑path shim.
Limitations & Future Work
- Assumes reliable clock monotonicity – LeaseGuard’s safety hinges on bounded clock drift; environments with highly variable clocks may need additional synchronization.
- Focused on single‑leader Raft – The protocol has not been evaluated in multi‑leader or sharded Raft deployments, which could expose new edge cases.
- Simulation‑heavy validation – While the LogCabin prototype shows promising numbers, larger‑scale production experiments (e.g., in geo‑distributed clusters) are needed to confirm scalability.
- Potential integration overhead – Existing Raft libraries may require non‑trivial refactoring to expose the term‑based lease hooks used by LeaseGuard.
Future research directions include extending LeaseGuard to work with Raft variants that support joint consensus, exploring adaptive lease durations based on observed network latency, and integrating the protocol into widely‑used open‑source Raft implementations (etcd, Consul) for broader community validation.
Authors
- A. Jesse Jiryu Davis
- Murat Demirbas
- Lingzhi Deng
Paper Information
- arXiv ID: 2512.15659v1
- Categories: cs.DC
- Published: December 17, 2025
- PDF: Download PDF