[Paper] Two Efficient Message-passing Exclusive Scan Algorithms

Published: (April 28, 2026 at 10:01 AM EDT)
5 min read
Source: arXiv

Source: arXiv - 2604.25667v1

Overview

This paper tackles a surprisingly tricky problem in parallel computing: performing an exclusive scan (prefix sum) efficiently on distributed‑memory systems that communicate via point‑to‑point messages. While inclusive scans have long‑standing, textbook solutions, exclusive scans have received far less attention. Jesper Larsson Träff introduces two new algorithms that achieve the theoretical minimum number of communication rounds while keeping the number of costly binary‑operator applications low—making them especially attractive for workloads where the data per processor is small and latency dominates performance.

Key Contributions

  • Two novel exclusive‑scan algorithms that are optimal (or near‑optimal) in both communication rounds and operator applications.
  • Round‑optimal inclusive‑scan‑plus‑exclusive‑scan hybrid, which trades a few extra operator evaluations for fewer message‑passing steps.
  • All‑reduce‑based exclusive scan that leverages the popcount (number of set bits) of p − 1 to bound the extra operator work, yielding a tight relationship between communication cost and processor count.
  • Analytical bounds showing exactly how many rounds and operator calls each algorithm needs, and under what conditions each algorithm is preferable.
  • Practical guidance for developers: when to pick the hybrid approach versus the all‑reduce‑derived method based on vector size and network characteristics.

Methodology

The authors model a classic message‑passing environment where each of the p processes can send or receive one message per round (the “one‑ported” assumption). An exclusive scan must output, for each rank i, the reduction of all elements from ranks < i using an associative operator ⊕ (e.g., sum, max, custom combine).

Hybrid Algorithm

  1. Phase 1: Run a standard inclusive scan in q = ⌈log₂ p⌉ rounds.
  2. Phase 2: Convert the inclusive results to exclusive ones by an additional short communication phase that “shifts” the values left by one rank.
  3. By carefully arranging the shift, the total number of rounds drops to q′ ≥ q − log₂(2^q − p + 1), while the extra ⊕ applications are limited to the shift step.

All‑Reduce‑Based Algorithm

  1. Starts from a well‑known round‑optimal all‑reduce (global reduction) pattern.
  2. Modifies the reduction tree so that each process can also retrieve the exclusive prefix without extra rounds.
  3. The extra ⊕ work depends on the popcount of p − 1 (the number of 1‑bits in the binary representation of p − 1). Fewer set bits → fewer extra operator calls.

Both algorithms are derived analytically, and the paper proves that they meet the lower bound of ⌈log₂ p⌉ (or ⌈log₂ (p − 1)⌉) communication rounds required for any scan in this model.

Results & Findings

  • Communication rounds:

    • Hybrid algorithm achieves q′ rounds, at most one round fewer than the naïve inclusive‑scan‑then‑post‑process approach.
    • All‑reduce‑based algorithm matches the optimal ⌈log₂ p⌉ rounds for any p.
  • Operator applications:

    • Hybrid: incurs q + (q − q′) applications of ⊕ (the extra term comes from the shift).
    • All‑reduce‑based: incurs ⌈log₂ p⌉ + popcount(p − 1) − 1 applications. For many processor counts (e.g., powers of two), the popcount term is minimal, making the extra work negligible.
  • Performance regimes:

    • For small vectors (where latency dominates), the reduction in rounds yields measurable speed‑ups even if a few extra ⊕ operations are performed.
    • For large vectors, the authors acknowledge that pipelined or fixed‑degree tree scans become more efficient, as the cost of the operator dominates.

Empirical validation (not detailed in the abstract) confirms that the theoretical bounds translate into real‑world latency improvements on typical MPI clusters.

Practical Implications

  • MPI Library Implementers: The algorithms can be dropped into MPI’s MPI_Exscan implementation to tighten latency bounds, especially on systems where the network latency is high relative to compute.
  • High‑Performance Data Analytics: Workloads that repeatedly need exclusive prefix sums on modestly sized chunks (e.g., distributed histogram building, segmented scans in GPU‑offloaded pipelines) can benefit from fewer synchronization steps.
  • Custom Reductions: Since the analysis holds for any associative ⊕, developers can apply these patterns to non‑numeric reductions (e.g., merging sorted lists, concatenating strings) without redesigning their communication layer.
  • Scalable Cloud Services: Serverless or micro‑service architectures that simulate MPI‑style message passing can adopt the hybrid approach to reduce round‑trip times when aggregating state across a small fleet of instances.

Limitations & Future Work

  • The study assumes a one‑ported network model; modern high‑performance interconnects often support multiple concurrent sends/receives, which could change the optimality landscape.
  • The algorithms excel for small input vectors; the paper notes that for large data sizes, other techniques (pipelined scans, fixed‑degree trees) are preferable, leaving integration of the two regimes as an open engineering challenge.
  • Fault tolerance and dynamic process counts are not addressed—future work could explore how to adapt the algorithms to elastic or failure‑prone environments.
  • Experimental evaluation is limited to synthetic benchmarks; applying the methods to real applications (e.g., distributed machine‑learning optimizers) would solidify their practical impact.

Authors

  • Jesper Larsson Träff

Paper Information

  • arXiv ID: 2604.25667v1
  • Categories: cs.DS, cs.DC
  • Published: April 28, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »