[Paper] Sharded Elimination and Combining for Highly-Efficient Concurrent Stacks

Published: (January 7, 2026 at 09:42 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.04523v1

Overview

The paper introduces Sharded Elimination and Combining, a novel blocking, linearizable stack that blends sharding with a clever elimination‑and‑combining scheme. By reducing contention on the shared data structure, the authors achieve up to 2× speed‑up over the best‑known concurrent stacks, especially when many threads contend for the stack simultaneously.

Key Contributions

  • Sharded Stack Layout – the stack is partitioned into independent “shards” that can be accessed in parallel, lowering hotspot contention.
  • Elimination Mechanism – push and pop operations that meet each other directly bypass the central stack, canceling each other out without touching shared memory.
  • Combining Technique – a lightweight coordinator batches pending operations from multiple threads, applying them to a shard in a single atomic step.
  • Hybrid Design – the elimination and combining components are seamlessly integrated, allowing the system to adapt dynamically to workload characteristics.
  • Empirical Evaluation – extensive benchmarks on multi‑core machines (up to 128 threads) show consistent 1.5‑2× performance gains over state‑of‑the‑art concurrent stacks (e.g., Treiber, Elimination‑Backoff, Flat‑Combining).

Methodology

  1. Sharding – The stack is split into k sub‑stacks (shards). Each thread first hashes to a shard, so most operations stay local.
  2. Elimination Array – Threads publish their intent (push or pop) in a fixed‑size elimination array. If a complementary operation is found, the two threads exchange values directly and both return, avoiding any memory write to a shard.
  3. Combining Scheduler – When elimination fails, a thread may become a combiner for its shard. It collects pending requests from other threads (via per‑thread request slots) and performs a batch update using a single fetch_and_increment on the shard’s top pointer.
  4. Fallback Path – If both elimination and combining are unavailable (e.g., array full, no pending requests), the thread falls back to a traditional lock‑based push/pop on its shard.
  5. Correctness – The authors prove linearizability by defining a linearization point either at the successful elimination exchange or at the atomic update performed by the combiner.

The design is deliberately simple: it relies only on atomic fetch_and_increment and compare_and_swap primitives that are universally available on modern CPUs.

Results & Findings

WorkloadThreadsSpeed‑up vs. Best Prior Stack
Uniform push/pop321.8×
Push‑heavy (90% pushes)641.6×
Pop‑heavy (90% pops)1282.0×
High contention (single hot key)641.9×
  • Scalability: Performance scales almost linearly up to the number of physical cores; beyond that, the sharding + combining still outperforms monolithic designs.
  • Contention Reduction: The elimination path handles up to ~40 % of operations without touching any shard, dramatically cutting cache‑line traffic.
  • Overhead: The extra bookkeeping (elimination array, request slots) adds < 5 ns per operation in the worst case, negligible compared to the gains.

Practical Implications

  • High‑Throughput Servers – Applications that use work‑stealing queues or task stacks (e.g., web servers, async runtimes) can replace their lock‑based stacks with this design to double throughput under load.
  • Parallel Algorithms – Depth‑first search, backtracking, or any algorithm that relies on a shared stack can benefit from reduced synchronization bottlenecks, especially on many‑core machines.
  • Language Runtime Implementations – VM developers (e.g., for Java, Go, Rust) can adopt the sharded elimination‑combining pattern for internal data structures such as coroutine stacks or garbage‑collector worklists.
  • Ease of Integration – The algorithm uses only standard atomic primitives; it can be implemented in C/C++, Rust, or Java’s java.util.concurrent.atomic package without special hardware support.

Limitations & Future Work

  • Memory Overhead – Maintaining multiple shards, an elimination array, and per‑thread request slots increases the memory footprint compared to a single‑linked list stack.
  • Static Sharding – The number of shards is fixed at initialization; dynamic workloads that change contention patterns might benefit from adaptive shard resizing.
  • Blocking Nature – The implementation is blocking (uses locks for the fallback path), which may be unsuitable for real‑time or lock‑free‑only environments.
  • Future Directions – The authors suggest exploring lock‑free combining, adaptive elimination array sizing, and applying the sharding‑elimination concept to other LIFO‑like structures (e.g., priority queues).

Authors

  • Ajay Singh
  • Nikos Metaxakis
  • Panagiota Fatourou

Paper Information

  • arXiv ID: 2601.04523v1
  • Categories: cs.DC, cs.PL
  • Published: January 8, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »