[Paper] Sharded Elimination and Combining for Highly-Efficient Concurrent Stacks

Published: 1 month ago (January 7, 2026 at 09:42 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.04523v1

Overview

The paper introduces Sharded Elimination and Combining, a novel blocking, linearizable stack that blends sharding with a clever elimination‑and‑combining scheme. By reducing contention on the shared data structure, the authors achieve up to 2× speed‑up over the best‑known concurrent stacks, especially when many threads contend for the stack simultaneously.

Key Contributions

Sharded Stack Layout – the stack is partitioned into independent “shards” that can be accessed in parallel, lowering hotspot contention.
Elimination Mechanism – push and pop operations that meet each other directly bypass the central stack, canceling each other out without touching shared memory.
Combining Technique – a lightweight coordinator batches pending operations from multiple threads, applying them to a shard in a single atomic step.
Hybrid Design – the elimination and combining components are seamlessly integrated, allowing the system to adapt dynamically to workload characteristics.
Empirical Evaluation – extensive benchmarks on multi‑core machines (up to 128 threads) show consistent 1.5‑2× performance gains over state‑of‑the‑art concurrent stacks (e.g., Treiber, Elimination‑Backoff, Flat‑Combining).

Methodology

Sharding – The stack is split into k sub‑stacks (shards). Each thread first hashes to a shard, so most operations stay local.
Elimination Array – Threads publish their intent (push or pop) in a fixed‑size elimination array. If a complementary operation is found, the two threads exchange values directly and both return, avoiding any memory write to a shard.
Combining Scheduler – When elimination fails, a thread may become a combiner for its shard. It collects pending requests from other threads (via per‑thread request slots) and performs a batch update using a single fetch_and_increment on the shard’s top pointer.
Fallback Path – If both elimination and combining are unavailable (e.g., array full, no pending requests), the thread falls back to a traditional lock‑based push/pop on its shard.
Correctness – The authors prove linearizability by defining a linearization point either at the successful elimination exchange or at the atomic update performed by the combiner.

The design is deliberately simple: it relies only on atomic fetch_and_increment and compare_and_swap primitives that are universally available on modern CPUs.

Results & Findings

Workload	Threads	Speed‑up vs. Best Prior Stack
Uniform push/pop	32	1.8×
Push‑heavy (90% pushes)	64	1.6×
Pop‑heavy (90% pops)	128	2.0×
High contention (single hot key)	64	1.9×

Scalability: Performance scales almost linearly up to the number of physical cores; beyond that, the sharding + combining still outperforms monolithic designs.
Contention Reduction: The elimination path handles up to ~40 % of operations without touching any shard, dramatically cutting cache‑line traffic.
Overhead: The extra bookkeeping (elimination array, request slots) adds < 5 ns per operation in the worst case, negligible compared to the gains.

Practical Implications

High‑Throughput Servers – Applications that use work‑stealing queues or task stacks (e.g., web servers, async runtimes) can replace their lock‑based stacks with this design to double throughput under load.
Parallel Algorithms – Depth‑first search, backtracking, or any algorithm that relies on a shared stack can benefit from reduced synchronization bottlenecks, especially on many‑core machines.
Language Runtime Implementations – VM developers (e.g., for Java, Go, Rust) can adopt the sharded elimination‑combining pattern for internal data structures such as coroutine stacks or garbage‑collector worklists.
Ease of Integration – The algorithm uses only standard atomic primitives; it can be implemented in C/C++, Rust, or Java’s java.util.concurrent.atomic package without special hardware support.

Limitations & Future Work

Memory Overhead – Maintaining multiple shards, an elimination array, and per‑thread request slots increases the memory footprint compared to a single‑linked list stack.
Static Sharding – The number of shards is fixed at initialization; dynamic workloads that change contention patterns might benefit from adaptive shard resizing.
Blocking Nature – The implementation is blocking (uses locks for the fallback path), which may be unsuitable for real‑time or lock‑free‑only environments.
Future Directions – The authors suggest exploring lock‑free combining, adaptive elimination array sizing, and applying the sharding‑elimination concept to other LIFO‑like structures (e.g., priority queues).

Authors

Ajay Singh
Nikos Metaxakis
Panagiota Fatourou

Paper Information

arXiv ID: 2601.04523v1
Categories: cs.DC, cs.PL
Published: January 8, 2026
PDF: Download PDF

[Paper] Sharded Elimination and Combining for Highly-Efficient Concurrent Stacks

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Multi-Modal Style Transfer-based Prompt Tuning for Efficient Federated Domain Generalization

[Paper] Performance-Portable Optimization and Analysis of Multiple Right-Hand Sides in a Lattice QCD Solver

[Paper] LACIN: Linearly Arranged Complete Interconnection Networks

[Paper] Self-Evolving Distributed Memory Architecture for Scalable AI Systems