[Paper] Rectangular Matrix Multiplication in the Low-Bandwidth Model

Published: (June 3, 2026 at 05:22 AM EDT)
5 min read
Source: arXiv

Source: arXiv - 2606.04652v1

Overview

This paper tackles a classic problem—matrix multiplication—but from the perspective of a low‑bandwidth distributed system where each of the (n) nodes can only exchange (O(\log n)) bits per round. While most prior work looked at square matrices, the authors focus on rectangular products (\langle a,b,c\rangle) (an (a\times b) matrix times a (b\times c) matrix) and ask: how many communication rounds are needed when the matrices are split across many tiny‑bandwidth machines? Their results reveal two distinct regimes and give tight round‑complexity bounds for the most natural aspect ratios.

Key Contributions

  • Formal model for rectangular matrix multiplication in the low‑bandwidth (CONGEST‑like) distributed setting.
  • Tight Θ(d √n) bound for the (\langle n,d,n\rangle) case when (d \le \sqrt{n}), matching upper and lower proofs.
  • Improved upper bound (\tilde O(d^{2/3} n^{2/3})) for (\langle n,d,n\rangle) when (d \ge \sqrt{n}), showing a phase‑transition in complexity.
  • Generalized upper bound (\tilde O(d^{4/3})) for the (\langle d,n,d\rangle) case, extending the known (\tilde O(n^{4/3})) result for square matrices.
  • Lower‑bound techniques that adapt communication‑complexity arguments to the rectangular setting, establishing the optimality of the (Θ(d √n)) regime.

Methodology

  1. Problem decomposition – The authors split the global matrix product into smaller sub‑problems that fit naturally on the (n) nodes, respecting the input distribution (each node initially holds a (\frac{1}{n}) fraction of each matrix).
  2. Communication‑aware algorithm design – For each regime they design a protocol that carefully schedules local multiplications and bandwidth‑limited broadcasts. The key idea is to balance the amount of data each node must send (which is limited to (O(\log n)) bits per round) against the amount of work it can do locally.
  3. Use of fast rectangular matrix multiplication algorithms – The paper leverages known algebraic techniques (e.g., Strassen‑type decompositions) to reduce the number of scalar multiplications, which directly translates into fewer communication rounds in the low‑bandwidth model.
  4. Lower‑bound construction – By reducing from known hard communication problems (like set‑disjointness) and applying information‑theoretic arguments, they prove that any algorithm must spend at least (\Omega(d\sqrt{n})) rounds when (d \le \sqrt{n}).

All steps are presented with enough intuition (e.g., “think of each node as a tiny router that can only forward a few bits per tick”) to be followed by developers familiar with distributed systems but not necessarily with deep algebraic complexity theory.

Results & Findings

InstanceParameter rangeRound complexity (upper)Round complexity (lower)Gap
(\langle d,n,d\rangle)any (d \le n)(\tilde O(d^{4/3}))(\Omega(d)) (trivial)Within polylog factor of optimal
(\langle n,d,n\rangle)(d \le \sqrt{n})(\Theta(d\sqrt{n}))(\Theta(d\sqrt{n}))Tight
(\langle n,d,n\rangle)(d \ge \sqrt{n})(O(d^{2/3} n^{2/3}))(\Omega(n)) (trivial)Open gap, but shows a phase transition

Interpretation

  • When the inner dimension (d) is small (≤ √n), the bottleneck is the need to shuttle (d) rows/columns across the network, leading to a linear dependence on (d) and a √n factor from the number of nodes that must be involved.
  • Once (d) grows beyond √n, the problem behaves more like square multiplication; the authors exploit faster rectangular multiplication algorithms to shave the exponent down to (2/3) on both (d) and (n).
  • The (\langle d,n,d\rangle) case mirrors the classic square‑matrix bound ((\tilde O(n^{4/3}))) but scales with the smaller side (d), making it attractive for “tall‑skinny” or “short‑wide” data matrices.

Practical Implications

  • Distributed data‑analytics pipelines (e.g., large‑scale recommendation systems) often need to multiply a user‑feature matrix (size (n \times d)) by a feature‑item matrix (size (d \times n)). The (\langle n,d,n\rangle) results give concrete expectations for how many communication rounds are required on a cluster where each node has limited network bandwidth (e.g., edge‑computing or IoT clusters).
  • Sparse‑aware frameworks can now decide whether to reshape a rectangular multiplication into a series of smaller square multiplications (using the (\langle d,n,d\rangle) bound) or to stick with the original layout, based on the aspect ratio.
  • The phase‑transition insight helps system designers choose optimal block sizes for matrix partitioning: for (d) below √n, a block‑size proportional to (d) is optimal; above √n, a more balanced block (≈(n^{2/3} d^{1/3})) yields fewer rounds.
  • In low‑bandwidth environments (e.g., satellite constellations, sensor networks, or privacy‑preserving federated learning where each client can only send tiny messages), these bounds give a realistic performance ceiling and can guide protocol design (e.g., when to aggregate locally vs. when to broadcast).

Limitations & Future Work

  • The lower bounds for the (d \ge \sqrt{n}) regime are still trivial ((\Omega(n))), leaving a gap between the (O(d^{2/3} n^{2/3})) upper bound and the known lower bound. Tightening this gap is an open challenge.
  • The analysis assumes uniform random distribution of input entries across nodes; real‑world data often exhibits skew, which could affect both communication patterns and round complexity.
  • The work focuses on semiring multiplication (e.g., Boolean, min‑plus). Extending the results to floating‑point or modular arithmetic—common in machine‑learning workloads—may require additional techniques.
  • Finally, the model restricts each round to (O(\log n)) bits per edge. Investigating how the results change under slightly higher bandwidth (e.g., polylog or constant‑size messages) could bridge the gap between this theoretical model and practical network stacks.

Authors

  • Chetan Gupta
  • Jukka Suomela
  • Hossein Vahidi

Paper Information

  • arXiv ID: 2606.04652v1
  • Categories: cs.DC
  • Published: June 3, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »