[Paper] Sharded Elimination and Combining for Highly-Efficient Concurrent Stacks
Source: arXiv - 2601.04523v1
Overview
The paper introduces Sharded Elimination and Combining, a novel blocking, linearizable stack that blends sharding with a clever elimination‑and‑combining scheme. By reducing contention on the shared data structure, the authors achieve up to 2× speed‑up over the best‑known concurrent stacks, especially when many threads contend for the stack simultaneously.
Key Contributions
- Sharded Stack Layout – the stack is partitioned into independent “shards” that can be accessed in parallel, lowering hotspot contention.
- Elimination Mechanism – push and pop operations that meet each other directly bypass the central stack, canceling each other out without touching shared memory.
- Combining Technique – a lightweight coordinator batches pending operations from multiple threads, applying them to a shard in a single atomic step.
- Hybrid Design – the elimination and combining components are seamlessly integrated, allowing the system to adapt dynamically to workload characteristics.
- Empirical Evaluation – extensive benchmarks on multi‑core machines (up to 128 threads) show consistent 1.5‑2× performance gains over state‑of‑the‑art concurrent stacks (e.g., Treiber, Elimination‑Backoff, Flat‑Combining).
Methodology
- Sharding – The stack is split into k sub‑stacks (shards). Each thread first hashes to a shard, so most operations stay local.
- Elimination Array – Threads publish their intent (push or pop) in a fixed‑size elimination array. If a complementary operation is found, the two threads exchange values directly and both return, avoiding any memory write to a shard.
- Combining Scheduler – When elimination fails, a thread may become a combiner for its shard. It collects pending requests from other threads (via per‑thread request slots) and performs a batch update using a single
fetch_and_incrementon the shard’s top pointer. - Fallback Path – If both elimination and combining are unavailable (e.g., array full, no pending requests), the thread falls back to a traditional lock‑based push/pop on its shard.
- Correctness – The authors prove linearizability by defining a linearization point either at the successful elimination exchange or at the atomic update performed by the combiner.
The design is deliberately simple: it relies only on atomic fetch_and_increment and compare_and_swap primitives that are universally available on modern CPUs.
Results & Findings
| Workload | Threads | Speed‑up vs. Best Prior Stack |
|---|---|---|
| Uniform push/pop | 32 | 1.8× |
| Push‑heavy (90% pushes) | 64 | 1.6× |
| Pop‑heavy (90% pops) | 128 | 2.0× |
| High contention (single hot key) | 64 | 1.9× |
- Scalability: Performance scales almost linearly up to the number of physical cores; beyond that, the sharding + combining still outperforms monolithic designs.
- Contention Reduction: The elimination path handles up to ~40 % of operations without touching any shard, dramatically cutting cache‑line traffic.
- Overhead: The extra bookkeeping (elimination array, request slots) adds < 5 ns per operation in the worst case, negligible compared to the gains.
Practical Implications
- High‑Throughput Servers – Applications that use work‑stealing queues or task stacks (e.g., web servers, async runtimes) can replace their lock‑based stacks with this design to double throughput under load.
- Parallel Algorithms – Depth‑first search, backtracking, or any algorithm that relies on a shared stack can benefit from reduced synchronization bottlenecks, especially on many‑core machines.
- Language Runtime Implementations – VM developers (e.g., for Java, Go, Rust) can adopt the sharded elimination‑combining pattern for internal data structures such as coroutine stacks or garbage‑collector worklists.
- Ease of Integration – The algorithm uses only standard atomic primitives; it can be implemented in C/C++, Rust, or Java’s
java.util.concurrent.atomicpackage without special hardware support.
Limitations & Future Work
- Memory Overhead – Maintaining multiple shards, an elimination array, and per‑thread request slots increases the memory footprint compared to a single‑linked list stack.
- Static Sharding – The number of shards is fixed at initialization; dynamic workloads that change contention patterns might benefit from adaptive shard resizing.
- Blocking Nature – The implementation is blocking (uses locks for the fallback path), which may be unsuitable for real‑time or lock‑free‑only environments.
- Future Directions – The authors suggest exploring lock‑free combining, adaptive elimination array sizing, and applying the sharding‑elimination concept to other LIFO‑like structures (e.g., priority queues).
Authors
- Ajay Singh
- Nikos Metaxakis
- Panagiota Fatourou
Paper Information
- arXiv ID: 2601.04523v1
- Categories: cs.DC, cs.PL
- Published: January 8, 2026
- PDF: Download PDF