[Paper] LEFT-RS: A Lock-Free Fault-Tolerant Resource Sharing Protocol for Multicore Real-Time Systems

Published: (December 25, 2025 at 09:52 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.21701v1

Overview

The paper introduces LEFT‑RS, a lock‑free, fault‑tolerant protocol that lets multiple real‑time tasks on multicore embedded systems share resources without the traditional blocking caused by locks. By allowing tasks to read shared data in parallel and recover quickly from transient faults, LEFT‑RS dramatically improves both timing predictability and overall system schedulability.

Key Contributions

  • Lock‑free resource sharing: Eliminates conventional mutexes, enabling concurrent reads of global resources while still guaranteeing exclusive writes.
  • Integrated fault tolerance: Detects transient faults inside critical sections and lets fault‑free tasks finish early, reducing the cascade of errors across tasks.
  • Bounded timing analysis: Provides a worst‑case response‑time (WCRT) model that preserves hard real‑time guarantees despite the lock‑free design.
  • Scalable parallel recovery: Uses lightweight parallel replica execution to recover from faults without the heavy coordination overhead of prior approaches.
  • Empirical validation: Shows up to 84.5 % improvement in schedulability on average compared with state‑of‑the‑art locking and fault‑tolerant schemes.

Methodology

  1. Parallel Critical Sections – Instead of a single task holding a lock, LEFT‑RS lets every task enter its critical section simultaneously. Reads are performed on a shared snapshot of the resource, while writes are staged locally.
  2. Fault Detection & Early Exit – Each task runs a lightweight checksum on its local copy. If a fault is detected, the task aborts its critical section, discarding its changes. Fault‑free tasks that have already validated their work can commit early, freeing the resource for others.
  3. Commit Protocol – A lightweight, lock‑free commit phase uses atomic compare‑and‑swap (CAS) operations to merge validated writes into the global state. Because only one task can successfully CAS at a time, mutual exclusion is achieved without a traditional lock.
  4. Timing Analysis – The authors extend classic response‑time analysis (RTA) to account for:
    • Parallel execution of critical sections,
    • Potential aborts due to faults,
    • The bounded overhead of the CAS‑based commit.
      This yields a closed‑form WCRT bound that can be plugged into existing real‑time schedulers.
  5. Evaluation Platform – Experiments were run on a set of synthetic task sets and a realistic automotive ECU benchmark, comparing LEFT‑RS against:
    • Traditional lock‑based protocols (e.g., MPCP, FMLP),
    • Existing fault‑tolerant schemes that rely on sequential replicas.

Results & Findings

MetricLEFT‑RSBest Prior Lock‑BasedPrior Fault‑Tolerant (Replica)
Schedulability gain↑ 84.5 % (avg.)baseline↑ 38 %
Average CPU utilization↓ 12 % (less blocking)higher due to lock waitsimilar to LEFT‑RS but with higher overhead
Fault recovery latency≤ 1.2 × single‑task exec timeN/A (no recovery)↑ 2.5 × single‑task exec time
Commit overhead1–2 CAS ops per critical sectionlock acquire/releasemultiple synchronization points

Key takeaways

  • Lock‑free access cuts the worst‑case blocking time dramatically, which directly translates into higher task‑set acceptance.
  • Early‑exit on fault prevents a single corrupted task from stalling all others, a common problem in traditional lock‑based designs.
  • The CAS‑based commit adds negligible overhead (just a couple of atomic instructions), making the approach practical on low‑power microcontrollers.

Practical Implications

  • Automotive & Aerospace – Safety‑critical ECUs can now run tighter control loops on multicore silicon without sacrificing determinism, even when transient electromagnetic interference is expected.
  • Industrial IoT – Edge devices that share sensor buffers or actuators can maintain high throughput while still meeting hard deadlines, reducing the need for over‑provisioned cores.
  • OS & Runtime Designers – LEFT‑RS can be integrated as a library or kernel extension, offering a drop‑in replacement for mutexes in real‑time POSIX‑like APIs (e.g., pthread_mutex).
  • Developer Tooling – The WCRT analysis is compatible with existing schedulability analysis tools (e.g., Cheddar, RTSS), allowing engineers to evaluate the impact of switching to LEFT‑RS without rewriting models.

In short, LEFT‑RS gives developers a way to keep the cores busy (higher utilization) while still guaranteeing that critical sections complete on time, even in the presence of transient faults.

Limitations & Future Work

  • Fault Model – The protocol assumes transient faults that can be detected via checksums; permanent hardware failures still require higher‑level redundancy.
  • Resource Types – LEFT‑RS focuses on read‑mostly shared data with occasional writes; heavily write‑contended resources may still suffer from commit contention.
  • Hardware Support – The analysis presumes atomic CAS is available and fast; on some ultra‑low‑power cores without native CAS, a software fallback could increase overhead.
  • Scalability Beyond 8‑Core – Experiments capped at 8 cores; the authors plan to explore hierarchical commit schemes for many‑core systems.

Future research directions include extending the protocol to mixed‑criticality systems, integrating hardware error‑detecting codes for more robust fault detection, and evaluating LEFT‑RS on heterogeneous platforms (e.g., CPU‑GPU combos) where resource sharing spans different execution units.

Authors

  • Nan Chen
  • Xiaotian Dai
  • Tong Cheng
  • Alan Burns
  • Iain Bate
  • Shuai Zhao

Paper Information

  • arXiv ID: 2512.21701v1
  • Categories: cs.OS, cs.DC
  • Published: December 25, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »