[Paper] HEAL: Online Incremental Recovery for Leaderless Distributed Systems Across Persistency Models

Published: 3 days ago (February 8, 2026 at 11:25 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.08257v1

Overview

The paper introduces HEAL, a lightweight, online incremental recovery framework designed for modern leaderless distributed systems. By tackling node failures without halting the entire cluster, HEAL dramatically cuts recovery latency and keeps the impact on live‑system throughput to a minimum—an increasingly critical requirement for cloud services, edge platforms, and real‑time data pipelines.

Key Contributions

General‑purpose recovery scheme for non‑transactional, leaderless systems that works across multiple memory persistency models (e.g., volatile RAM, NVRAM, SSD).
Optimized incremental recovery algorithm that restores only the lost state rather than replaying the whole log, achieving sub‑second recovery times.
Formal treatment of linearizable consistency during recovery, ensuring that client‑visible ordering guarantees are preserved without a central coordinator.
Prototype implementation and evaluation on a 6‑node Intel cluster using the TAOBench benchmark, showing up to 20× faster recovery and >60% lower throughput degradation compared to existing leader‑based approaches.
Extensible design that can be plugged into a variety of leaderless key‑value stores, CRDT‑based databases, and distributed caches.

Methodology

System Model & Assumptions
- Nodes communicate via asynchronous message passing.
- No single leader; any node can serve client requests.
- The system provides linearizable reads/writes (i.e., operations appear to execute atomically in a total order).
Persistency Models
- The authors categorize storage into three models: (a) volatile memory only, (b) write‑ahead logging to durable media, and (c) hardware‑assisted persistent memory (e.g., NVRAM).
- For each model, they derive the minimal set of metadata that must survive a crash to enable safe incremental recovery.
Incremental Recovery Algorithm (HEAL)
- Failure Detection: Nodes use heartbeat timeouts; the surviving quorum quickly identifies the failed node.
- State Snapshot Exchange: Surviving nodes exchange compact state digests (hashes + version vectors) to pinpoint missing updates.
- Selective Replay: Only the missing updates for the failed node are streamed from replicas that hold them, avoiding a full log replay.
- Consistency Reinforcement: A lightweight coordination phase ensures that the re‑joined node’s state is linearizable with respect to ongoing operations.
Implementation
- Integrated HEAL into a TAOBench‑compatible key‑value store (a leaderless variant of a distributed hash table).
- Ran experiments on a 6‑node Intel Xeon cluster, varying workload intensity, failure patterns, and persistency configurations.

Results & Findings

Metric	HEAL (Leaderless)	Conventional Leaderless Recovery	Leader‑Based Incremental Recovery
Average Recovery Latency	120 ms	360 s (≈ 3000× slower)	2.4 s (≈ 20× slower than HEAL)
Throughput Degradation During Recovery	8.7 % drop	16.2 % drop	62.4 % drop
Impact of Persistency Model	Minimal (≤ 2 ms variance)	Same as baseline	Similar to baseline
Scalability (up to 12 nodes)	Linear increase in recovery speed, still < 250 ms	Exponential slowdown	Moderate slowdown

Interpretation:

Speed: By transferring only the delta, HEAL reduces recovery from minutes to a fraction of a second, making failure “invisible” to most client workloads.
Throughput: The modest 8.7 % dip means services can keep serving requests while a node rejoins, a stark contrast to the 16 %+ hit of naïve leaderless schemes.
Model‑agnostic: The algorithm’s performance holds across volatile, log‑based, and persistent‑memory setups, confirming its generality.

Practical Implications

Microservice & Edge Deployments: Systems that deliberately avoid a single point of failure (e.g., serverless function caches, edge data stores) can now recover from node churn without sacrificing latency guarantees.
Cost Savings: Faster recovery translates to less over‑provisioning; operators can run smaller clusters while still meeting SLAs.
Simplified Ops: No need to elect a new leader or perform heavyweight checkpointing after each failure, reducing operational complexity and the risk of split‑brain scenarios.
Compatibility with Emerging Hardware: HEAL’s design works equally well with NVRAM or Intel Optane, allowing developers to adopt persistent memory without redesigning recovery logic.
Potential Integration Paths: Existing leaderless frameworks (e.g., Cassandra’s “no‑leader” mode, Dynamo‑style stores, CRDT libraries) could adopt HEAL’s incremental recovery module to boost resilience.

Limitations & Future Work

Scale Beyond Small Clusters: Experiments were limited to ≤ 12 nodes; the authors note that network topology and gossip‑based failure detection could become bottlenecks at larger scales.
Assumes Linearizability: While many services need strong consistency, some tolerate eventual consistency; extending HEAL to weaker models could broaden its applicability.
Recovery under Simultaneous Multi‑Node Failures: The current design focuses on a single-node crash; handling correlated failures (e.g., rack loss) remains an open challenge.
Security Considerations: State digest exchanges are not authenticated in the prototype; integrating cryptographic verification would be necessary for production environments.

Overall, HEAL offers a compelling blueprint for making leaderless distributed systems truly fault‑tolerant without paying the traditional performance penalty.

Authors

Antonis Psistakis
Burak Ocalan
Fabien Chaix
Ramnatthan Alagappan
Josep Torrellas

Paper Information

arXiv ID: 2602.08257v1
Categories: cs.DC
Published: February 9, 2026
PDF: Download PDF

[Paper] HEAL: Online Incremental Recovery for Leaderless Distributed Systems Across Persistency Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Min-Sum Uniform Coverage Problem by Autonomous Mobile Robots

[Paper] BOute: Cost-Efficient LLM Serving with Heterogeneous LLMs and GPUs via Multi-Objective Bayesian Optimization

[Paper] Computing Least Fixed Points with Overwrite Semantics in Parallel and Distributed Systems

[Paper] Implementability of Global Distributed Protocols modulo Network Architectures