[Paper] Rectangular Matrix Multiplication in the Low-Bandwidth Model

Published: 1 day ago (June 3, 2026 at 05:22 AM EDT)

5 min read

Source: arXiv

Source: arXiv - 2606.04652v1

Overview

This paper tackles a classic problem—matrix multiplication—but from the perspective of a low‑bandwidth distributed system where each of the (n) nodes can only exchange (O(\log n)) bits per round. While most prior work looked at square matrices, the authors focus on rectangular products (\langle a,b,c\rangle) (an (a\times b) matrix times a (b\times c) matrix) and ask: how many communication rounds are needed when the matrices are split across many tiny‑bandwidth machines? Their results reveal two distinct regimes and give tight round‑complexity bounds for the most natural aspect ratios.

Key Contributions

Formal model for rectangular matrix multiplication in the low‑bandwidth (CONGEST‑like) distributed setting.
Tight Θ(d √n) bound for the (\langle n,d,n\rangle) case when (d \le \sqrt{n}), matching upper and lower proofs.
Improved upper bound (\tilde O(d^{2/3} n^{2/3})) for (\langle n,d,n\rangle) when (d \ge \sqrt{n}), showing a phase‑transition in complexity.
Generalized upper bound (\tilde O(d^{4/3})) for the (\langle d,n,d\rangle) case, extending the known (\tilde O(n^{4/3})) result for square matrices.
Lower‑bound techniques that adapt communication‑complexity arguments to the rectangular setting, establishing the optimality of the (Θ(d √n)) regime.

Methodology

Problem decomposition – The authors split the global matrix product into smaller sub‑problems that fit naturally on the (n) nodes, respecting the input distribution (each node initially holds a (\frac{1}{n}) fraction of each matrix).
Communication‑aware algorithm design – For each regime they design a protocol that carefully schedules local multiplications and bandwidth‑limited broadcasts. The key idea is to balance the amount of data each node must send (which is limited to (O(\log n)) bits per round) against the amount of work it can do locally.
Use of fast rectangular matrix multiplication algorithms – The paper leverages known algebraic techniques (e.g., Strassen‑type decompositions) to reduce the number of scalar multiplications, which directly translates into fewer communication rounds in the low‑bandwidth model.
Lower‑bound construction – By reducing from known hard communication problems (like set‑disjointness) and applying information‑theoretic arguments, they prove that any algorithm must spend at least (\Omega(d\sqrt{n})) rounds when (d \le \sqrt{n}).

All steps are presented with enough intuition (e.g., “think of each node as a tiny router that can only forward a few bits per tick”) to be followed by developers familiar with distributed systems but not necessarily with deep algebraic complexity theory.

Results & Findings

Instance	Parameter range	Round complexity (upper)	Round complexity (lower)	Gap
(\langle d,n,d\rangle)	any (d \le n)	(\tilde O(d^{4/3}))	(\Omega(d)) (trivial)	Within polylog factor of optimal
(\langle n,d,n\rangle)	(d \le \sqrt{n})	(\Theta(d\sqrt{n}))	(\Theta(d\sqrt{n}))	Tight
(\langle n,d,n\rangle)	(d \ge \sqrt{n})	(O(d^{2/3} n^{2/3}))	(\Omega(n)) (trivial)	Open gap, but shows a phase transition

Interpretation

When the inner dimension (d) is small (≤ √n), the bottleneck is the need to shuttle (d) rows/columns across the network, leading to a linear dependence on (d) and a √n factor from the number of nodes that must be involved.
Once (d) grows beyond √n, the problem behaves more like square multiplication; the authors exploit faster rectangular multiplication algorithms to shave the exponent down to (2/3) on both (d) and (n).
The (\langle d,n,d\rangle) case mirrors the classic square‑matrix bound ((\tilde O(n^{4/3}))) but scales with the smaller side (d), making it attractive for “tall‑skinny” or “short‑wide” data matrices.

Practical Implications

Distributed data‑analytics pipelines (e.g., large‑scale recommendation systems) often need to multiply a user‑feature matrix (size (n \times d)) by a feature‑item matrix (size (d \times n)). The (\langle n,d,n\rangle) results give concrete expectations for how many communication rounds are required on a cluster where each node has limited network bandwidth (e.g., edge‑computing or IoT clusters).
Sparse‑aware frameworks can now decide whether to reshape a rectangular multiplication into a series of smaller square multiplications (using the (\langle d,n,d\rangle) bound) or to stick with the original layout, based on the aspect ratio.
The phase‑transition insight helps system designers choose optimal block sizes for matrix partitioning: for (d) below √n, a block‑size proportional to (d) is optimal; above √n, a more balanced block (≈(n^{2/3} d^{1/3})) yields fewer rounds.
In low‑bandwidth environments (e.g., satellite constellations, sensor networks, or privacy‑preserving federated learning where each client can only send tiny messages), these bounds give a realistic performance ceiling and can guide protocol design (e.g., when to aggregate locally vs. when to broadcast).

Limitations & Future Work

The lower bounds for the (d \ge \sqrt{n}) regime are still trivial ((\Omega(n))), leaving a gap between the (O(d^{2/3} n^{2/3})) upper bound and the known lower bound. Tightening this gap is an open challenge.
The analysis assumes uniform random distribution of input entries across nodes; real‑world data often exhibits skew, which could affect both communication patterns and round complexity.
The work focuses on semiring multiplication (e.g., Boolean, min‑plus). Extending the results to floating‑point or modular arithmetic—common in machine‑learning workloads—may require additional techniques.
Finally, the model restricts each round to (O(\log n)) bits per edge. Investigating how the results change under slightly higher bandwidth (e.g., polylog or constant‑size messages) could bridge the gap between this theoretical model and practical network stacks.

Authors

Chetan Gupta
Jukka Suomela
Hossein Vahidi

Paper Information

arXiv ID: 2606.04652v1
Categories: cs.DC
Published: June 3, 2026
PDF: Download PDF

[Paper] Rectangular Matrix Multiplication in the Low-Bandwidth Model

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Graph Traversal on Tensor Cores: A BFS Framework for Modern GPUs

[Paper] The local complexity of certifying parity

[Paper] The Usefulness Gap in Proof-of-Useful-Work: An Empirical Study of Pearl's cuPOW Protocol

[Paper] Clownfish: Scaling DAG-based BFT Consensus via Sparse Edges