[Paper] Basic Lock Algorithms in Lightweight Thread Environments

Published: 2 months ago (December 9, 2025 at 08:03 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.08563v1

Overview

The paper Basic Lock Algorithms in Lightweight Thread Environments examines how classic mutex designs—TTAS (Test‑Test‑And‑Set) and MCS (Mellor‑Crummey‑Scalable)—behave when they are used by lightweight threads (coroutines, async calls) instead of traditional OS threads. Because lightweight threads require explicit context‑switch points (yield or sleep), the authors show that naïve porting of OS‑thread locks can cause deadlocks and severe performance regressions. They propose lightweight‑thread‑aware variants and a hybrid “cohort” lock that works robustly across different coroutine libraries.

Key Contributions

Problem identification: Demonstrates that conventional OS‑thread mutexes are unsafe for coroutine‑based concurrency due to missing implicit pre‑emptive scheduling.
Lock adaptations: Provides concrete modifications to TTAS and MCS locks that incorporate coroutine‑specific context‑switch primitives (yield & sleep).
Hybrid cohort lock: Introduces a lock that combines several MCS queues with a shared TTAS front‑end, delivering stable performance across diverse lightweight‑thread runtimes.
Empirical evaluation: Benchmarks the three lock families (TTAS, MCS, cohort) on multiple C++ coroutine libraries, highlighting scenarios where each excels or fails.
Guidelines for developers: Offers practical recommendations on selecting or implementing mutexes when targeting coroutine‑heavy codebases.

Methodology

Modeling lightweight threads: The authors treat coroutines as user‑level schedulable units that only switch when the program explicitly calls yield() (cooperative) or when the runtime forces a sleep() (pre‑emptive).
Lock redesign:
- TTAS‑LT inserts a yield() after a failed spin to give other coroutines a chance to release the lock.
- MCS‑LT augments each node with a sleep() call when a thread must wait in the queue, preventing busy‑waiting that would otherwise block the scheduler.
Cohort lock construction: Multiple per‑core MCS queues feed into a global TTAS gate; the gate protects the transition between queues, while each queue handles local contention.
Experimental setup: The authors evaluate the three lock variants on four popular C++ coroutine libraries (Boost.Asio, libco, cppcoro, and the upcoming C++20 coroutine TS). They vary contention levels, critical‑section lengths, and numbers of logical cores, measuring latency, throughput, and CPU utilization.

Results & Findings

Lock variant	Low contention	High contention (many coroutines)	Critical‑section length
TTAS‑LT	Excellent latency (few yields)	Severe slowdown due to excessive yielding	Works well when CS is short
MCS‑LT	Competitive, but higher per‑lock overhead	Scales better; avoids busy‑waiting deadlocks	Handles longer CS gracefully
Cohort	Slightly higher latency than TTAS‑LT	Consistently high throughput; near‑optimal across all settings	Balanced performance for both short and long CS

Key take‑aways

Pure TTAS fails when many coroutines block on a lock because they keep yielding without progress, leading to “livelock.”
Pure MCS eliminates livelock but incurs extra memory traffic and context‑switch overhead, which hurts low‑contention cases.
The cohort lock achieves a sweet spot: it limits the number of coroutines that enter the expensive MCS queue while still protecting against deadlock, delivering stable performance across libraries and workloads.

Practical Implications

Library developers: When exposing synchronization primitives in a coroutine‑friendly API, prefer the cohort lock or provide both TTAS‑LT and MCS‑LT as options, letting users pick based on expected contention.
Systems engineers: Existing lock‑heavy codebases can be ported to async frameworks with minimal changes by swapping std::mutex for the provided lt::mutex (lightweight‑thread mutex).
Performance tuning: For latency‑critical services (e.g., high‑frequency trading gateways) that use short critical sections, a tuned TTAS‑LT may still be optimal; otherwise, the cohort lock is a safe default.
Cross‑library portability: Because the cohort lock abstracts away the underlying yielding/sleeping mechanism, the same binary can run on Boost.Asio, libco, or native C++20 coroutines without recompilation.

Limitations & Future Work

Memory overhead: MCS‑based variants allocate a per‑coroutine queue node, which can be non‑trivial in memory‑constrained embedded environments.
Scheduler dependence: The effectiveness of yield() and sleep() hooks assumes a reasonably fair underlying coroutine scheduler; pathological schedulers could still cause starvation.
Broader primitives: The study focuses solely on mutexes; extending the analysis to read‑write locks, condition variables, or lock‑free structures remains open.
Language support: While the experiments target C++, the authors note that adapting the cohort lock to other coroutine‑enabled languages (Rust, Go, JavaScript) warrants further investigation.

Bottom line: As lightweight threads become mainstream in high‑performance servers and low‑latency applications, rethinking classic lock algorithms is essential. This paper provides both the theory and ready‑to‑use implementations that let developers harness the speed of coroutines without falling into deadlocks or unpredictable slowdowns.

Authors

Taras Skazhenik
Nikolai Korobenikov
Andrei Churbanov
Anton Malakhov
Vitaly Aksenov

Paper Information

arXiv ID: 2512.08563v1
Categories: cs.DC
Published: December 9, 2025
PDF: Download PDF

[Paper] Basic Lock Algorithms in Lightweight Thread Environments

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Hypergraph based Multi-Party Payment Channel

[Paper] Stateless Snowflake: A Cloud-Agnostic Distributed ID Generator Using Network-Derived Identity

[Paper] FirecREST v2: lessons learned from redesigning an API for scalable HPC resource access

[Paper] Enhanced Pruning for Distributed Closeness Centrality under Multi-Packet Messaging