[Paper] Towards CXL Resilience to CPU Failures

Published: 3 days ago (February 9, 2026 at 12:08 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.08271v1

Overview

The paper introduces ReCXL, an extension to the Compute Express Link (CXL) 3.0 standard that makes shared‑memory clusters resilient to CPU (node) failures. By adding lightweight replication and hardware logging to the coherence protocol, ReCXL can recover a consistent application state after a node crashes, with only about a 30 % performance penalty compared with a non‑fault‑tolerant system.

Key Contributions

Resilient Coherence Protocol – Augments each write transaction with a small set of replica nodes that store a copy of the update in a dedicated hardware Logging Unit (LU).
Hardware Logging Unit Design – Defines a minimal, low‑latency log buffer that can be flushed to main memory periodically, providing durable metadata for recovery.
Recovery Procedure – Shows how, after a node failure, the remaining nodes use the logs to reconstruct the directory and memory state, bringing the system back to a consistent point‑in‑time.
Specification Extension – Proposes concrete changes to the CXL spec (message formats, error‑handling semantics) that enable the above mechanisms without breaking existing CXL 3.0 functionality.
Performance Evaluation – Demonstrates that the added fault‑tolerance incurs ~30 % slowdown on typical HPC/AI workloads, far lower than software‑only checkpoint/restart approaches.

Methodology

Protocol Augmentation – For every cache‑line write, the originating node sends the normal coherence message plus a “replication payload” to a configurable subset of peer nodes (the Replicas).
Logging Unit (LU) – Each replica stores the incoming payload in a small, fast on‑chip log buffer. The LU is designed to survive a node crash (e.g., powered by a separate power domain).
Periodic Flush – A background daemon on each node triggers the LU to write its accumulated logs to non‑volatile memory (or persistent DRAM) at regular intervals, guaranteeing durability.
Failure Detection & Recovery – Upon detecting a node failure (via CXL error signals), the surviving nodes read the persisted logs, replay the updates to rebuild the directory state, and resume execution from the last consistent point.
Evaluation Setup – The authors implemented ReCXL in a cycle‑accurate CXL simulator and ran a suite of memory‑intensive benchmarks (STREAM, Graph500, deep‑learning training kernels). They measured throughput, latency, and recovery time under injected node failures.

Results & Findings

Metric	Baseline (no fault‑tolerance)	ReCXL (with fault‑tolerance)
Average throughput (GB/s)	112	78 (≈30 % slowdown)
Latency per write (ns)	45	58
Recovery time after node crash	N/A (requires full restart)	1.2 s on average (log replay)
Memory overhead for logs	—	3 % of total DRAM capacity

Performance Impact – The extra replication traffic is limited to a small replica set (typically 2‑3 nodes), keeping bandwidth overhead modest.
Fast Recovery – Because logs are already persisted, the system can resume within seconds, far quicker than traditional checkpoint/restart (which can take minutes).
Scalability – Experiments up to 64 nodes show linear increase in fault‑tolerance cost, confirming that the approach scales with cluster size.

Practical Implications

Higher Availability for Distributed AI/ML – Training jobs that run for days can survive a single node failure without a full restart, reducing wasted compute time and cloud costs.
Simplified System Software – Operating systems and runtime libraries can rely on hardware‑assisted resilience, lowering the need for heavyweight checkpoint libraries.
Edge & Fog Deployments – In environments where power loss or CPU crashes are common (e.g., autonomous vehicles, IoT gateways), ReCXL’s hardware logging offers a lightweight way to keep shared state consistent.
Future CXL‑Based Accelerators – Designers of GPUs, FPGAs, or custom AI ASICs that connect via CXL can adopt the proposed spec extensions to provide built‑in fault tolerance, making heterogeneous clusters more robust.

Limitations & Future Work

Replica Selection Overhead – The current design uses a static replica set; dynamic selection based on workload or network topology could further reduce latency.
Log Buffer Size – The LU is sized for typical workloads; extreme write‑intensive applications may require larger buffers or more frequent flushes, impacting performance.
Power‑Domain Assumptions – The resilience relies on the LU surviving a node power loss; hardware implementations must guarantee this, which may increase silicon cost.
Broader Failure Modes – The paper focuses on CPU/node crashes; handling network partitions, memory controller failures, or simultaneous multi‑node failures remains open.

Future research directions include adaptive replication strategies, integration with existing checkpoint/restart frameworks for multi‑failure scenarios, and prototyping the design on real CXL‑enabled hardware platforms.

Authors

Antonis Psistakis
Burak Ocalan
Chloe Alverti
Fabien Chaix
Ramnatthan Alagappan
Josep Torrellas

Paper Information

arXiv ID: 2602.08271v1
Categories: cs.DC
Published: February 9, 2026
PDF: Download PDF

[Paper] Towards CXL Resilience to CPU Failures

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Min-Sum Uniform Coverage Problem by Autonomous Mobile Robots

[Paper] BOute: Cost-Efficient LLM Serving with Heterogeneous LLMs and GPUs via Multi-Objective Bayesian Optimization

[Paper] Computing Least Fixed Points with Overwrite Semantics in Parallel and Distributed Systems

[Paper] Implementability of Global Distributed Protocols modulo Network Architectures