[Paper] HEAL: Online Incremental Recovery for Leaderless Distributed Systems Across Persistency Models

Published: (February 8, 2026 at 11:25 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.08257v1

Overview

The paper introduces HEAL, a lightweight, online incremental recovery framework designed for modern leaderless distributed systems. By tackling node failures without halting the entire cluster, HEAL dramatically cuts recovery latency and keeps the impact on live‑system throughput to a minimum—an increasingly critical requirement for cloud services, edge platforms, and real‑time data pipelines.

Key Contributions

  • General‑purpose recovery scheme for non‑transactional, leaderless systems that works across multiple memory persistency models (e.g., volatile RAM, NVRAM, SSD).
  • Optimized incremental recovery algorithm that restores only the lost state rather than replaying the whole log, achieving sub‑second recovery times.
  • Formal treatment of linearizable consistency during recovery, ensuring that client‑visible ordering guarantees are preserved without a central coordinator.
  • Prototype implementation and evaluation on a 6‑node Intel cluster using the TAOBench benchmark, showing up to 20× faster recovery and >60% lower throughput degradation compared to existing leader‑based approaches.
  • Extensible design that can be plugged into a variety of leaderless key‑value stores, CRDT‑based databases, and distributed caches.

Methodology

  1. System Model & Assumptions

    • Nodes communicate via asynchronous message passing.
    • No single leader; any node can serve client requests.
    • The system provides linearizable reads/writes (i.e., operations appear to execute atomically in a total order).
  2. Persistency Models

    • The authors categorize storage into three models: (a) volatile memory only, (b) write‑ahead logging to durable media, and (c) hardware‑assisted persistent memory (e.g., NVRAM).
    • For each model, they derive the minimal set of metadata that must survive a crash to enable safe incremental recovery.
  3. Incremental Recovery Algorithm (HEAL)

    • Failure Detection: Nodes use heartbeat timeouts; the surviving quorum quickly identifies the failed node.
    • State Snapshot Exchange: Surviving nodes exchange compact state digests (hashes + version vectors) to pinpoint missing updates.
    • Selective Replay: Only the missing updates for the failed node are streamed from replicas that hold them, avoiding a full log replay.
    • Consistency Reinforcement: A lightweight coordination phase ensures that the re‑joined node’s state is linearizable with respect to ongoing operations.
  4. Implementation

    • Integrated HEAL into a TAOBench‑compatible key‑value store (a leaderless variant of a distributed hash table).
    • Ran experiments on a 6‑node Intel Xeon cluster, varying workload intensity, failure patterns, and persistency configurations.

Results & Findings

MetricHEAL (Leaderless)Conventional Leaderless RecoveryLeader‑Based Incremental Recovery
Average Recovery Latency120 ms360 s (≈ 3000× slower)2.4 s (≈ 20× slower than HEAL)
Throughput Degradation During Recovery8.7 % drop16.2 % drop62.4 % drop
Impact of Persistency ModelMinimal (≤ 2 ms variance)Same as baselineSimilar to baseline
Scalability (up to 12 nodes)Linear increase in recovery speed, still < 250 msExponential slowdownModerate slowdown

Interpretation:

  • Speed: By transferring only the delta, HEAL reduces recovery from minutes to a fraction of a second, making failure “invisible” to most client workloads.
  • Throughput: The modest 8.7 % dip means services can keep serving requests while a node rejoins, a stark contrast to the 16 %+ hit of naïve leaderless schemes.
  • Model‑agnostic: The algorithm’s performance holds across volatile, log‑based, and persistent‑memory setups, confirming its generality.

Practical Implications

  • Microservice & Edge Deployments: Systems that deliberately avoid a single point of failure (e.g., serverless function caches, edge data stores) can now recover from node churn without sacrificing latency guarantees.
  • Cost Savings: Faster recovery translates to less over‑provisioning; operators can run smaller clusters while still meeting SLAs.
  • Simplified Ops: No need to elect a new leader or perform heavyweight checkpointing after each failure, reducing operational complexity and the risk of split‑brain scenarios.
  • Compatibility with Emerging Hardware: HEAL’s design works equally well with NVRAM or Intel Optane, allowing developers to adopt persistent memory without redesigning recovery logic.
  • Potential Integration Paths: Existing leaderless frameworks (e.g., Cassandra’s “no‑leader” mode, Dynamo‑style stores, CRDT libraries) could adopt HEAL’s incremental recovery module to boost resilience.

Limitations & Future Work

  • Scale Beyond Small Clusters: Experiments were limited to ≤ 12 nodes; the authors note that network topology and gossip‑based failure detection could become bottlenecks at larger scales.
  • Assumes Linearizability: While many services need strong consistency, some tolerate eventual consistency; extending HEAL to weaker models could broaden its applicability.
  • Recovery under Simultaneous Multi‑Node Failures: The current design focuses on a single-node crash; handling correlated failures (e.g., rack loss) remains an open challenge.
  • Security Considerations: State digest exchanges are not authenticated in the prototype; integrating cryptographic verification would be necessary for production environments.

Overall, HEAL offers a compelling blueprint for making leaderless distributed systems truly fault‑tolerant without paying the traditional performance penalty.

Authors

  • Antonis Psistakis
  • Burak Ocalan
  • Fabien Chaix
  • Ramnatthan Alagappan
  • Josep Torrellas

Paper Information

  • arXiv ID: 2602.08257v1
  • Categories: cs.DC
  • Published: February 9, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »