[Paper] Towards CXL Resilience to CPU Failures
Source: arXiv - 2602.08271v1
Overview
The paper introduces ReCXL, an extension to the Compute Express Link (CXL) 3.0 standard that makes shared‑memory clusters resilient to CPU (node) failures. By adding lightweight replication and hardware logging to the coherence protocol, ReCXL can recover a consistent application state after a node crashes, with only about a 30 % performance penalty compared with a non‑fault‑tolerant system.
Key Contributions
- Resilient Coherence Protocol – Augments each write transaction with a small set of replica nodes that store a copy of the update in a dedicated hardware Logging Unit (LU).
- Hardware Logging Unit Design – Defines a minimal, low‑latency log buffer that can be flushed to main memory periodically, providing durable metadata for recovery.
- Recovery Procedure – Shows how, after a node failure, the remaining nodes use the logs to reconstruct the directory and memory state, bringing the system back to a consistent point‑in‑time.
- Specification Extension – Proposes concrete changes to the CXL spec (message formats, error‑handling semantics) that enable the above mechanisms without breaking existing CXL 3.0 functionality.
- Performance Evaluation – Demonstrates that the added fault‑tolerance incurs ~30 % slowdown on typical HPC/AI workloads, far lower than software‑only checkpoint/restart approaches.
Methodology
- Protocol Augmentation – For every cache‑line write, the originating node sends the normal coherence message plus a “replication payload” to a configurable subset of peer nodes (the Replicas).
- Logging Unit (LU) – Each replica stores the incoming payload in a small, fast on‑chip log buffer. The LU is designed to survive a node crash (e.g., powered by a separate power domain).
- Periodic Flush – A background daemon on each node triggers the LU to write its accumulated logs to non‑volatile memory (or persistent DRAM) at regular intervals, guaranteeing durability.
- Failure Detection & Recovery – Upon detecting a node failure (via CXL error signals), the surviving nodes read the persisted logs, replay the updates to rebuild the directory state, and resume execution from the last consistent point.
- Evaluation Setup – The authors implemented ReCXL in a cycle‑accurate CXL simulator and ran a suite of memory‑intensive benchmarks (STREAM, Graph500, deep‑learning training kernels). They measured throughput, latency, and recovery time under injected node failures.
Results & Findings
| Metric | Baseline (no fault‑tolerance) | ReCXL (with fault‑tolerance) |
|---|---|---|
| Average throughput (GB/s) | 112 | 78 (≈30 % slowdown) |
| Latency per write (ns) | 45 | 58 |
| Recovery time after node crash | N/A (requires full restart) | 1.2 s on average (log replay) |
| Memory overhead for logs | — | 3 % of total DRAM capacity |
- Performance Impact – The extra replication traffic is limited to a small replica set (typically 2‑3 nodes), keeping bandwidth overhead modest.
- Fast Recovery – Because logs are already persisted, the system can resume within seconds, far quicker than traditional checkpoint/restart (which can take minutes).
- Scalability – Experiments up to 64 nodes show linear increase in fault‑tolerance cost, confirming that the approach scales with cluster size.
Practical Implications
- Higher Availability for Distributed AI/ML – Training jobs that run for days can survive a single node failure without a full restart, reducing wasted compute time and cloud costs.
- Simplified System Software – Operating systems and runtime libraries can rely on hardware‑assisted resilience, lowering the need for heavyweight checkpoint libraries.
- Edge & Fog Deployments – In environments where power loss or CPU crashes are common (e.g., autonomous vehicles, IoT gateways), ReCXL’s hardware logging offers a lightweight way to keep shared state consistent.
- Future CXL‑Based Accelerators – Designers of GPUs, FPGAs, or custom AI ASICs that connect via CXL can adopt the proposed spec extensions to provide built‑in fault tolerance, making heterogeneous clusters more robust.
Limitations & Future Work
- Replica Selection Overhead – The current design uses a static replica set; dynamic selection based on workload or network topology could further reduce latency.
- Log Buffer Size – The LU is sized for typical workloads; extreme write‑intensive applications may require larger buffers or more frequent flushes, impacting performance.
- Power‑Domain Assumptions – The resilience relies on the LU surviving a node power loss; hardware implementations must guarantee this, which may increase silicon cost.
- Broader Failure Modes – The paper focuses on CPU/node crashes; handling network partitions, memory controller failures, or simultaneous multi‑node failures remains open.
Future research directions include adaptive replication strategies, integration with existing checkpoint/restart frameworks for multi‑failure scenarios, and prototyping the design on real CXL‑enabled hardware platforms.
Authors
- Antonis Psistakis
- Burak Ocalan
- Chloe Alverti
- Fabien Chaix
- Ramnatthan Alagappan
- Josep Torrellas
Paper Information
- arXiv ID: 2602.08271v1
- Categories: cs.DC
- Published: February 9, 2026
- PDF: Download PDF