[Paper] Mesh-Attention: A New Communication-Efficient Distributed Attention with Improved Data Locality
Source: arXiv - 2512.20968v1
Overview
Scaling the attention mechanism of large language models (LLMs) across many GPUs is a bottleneck for ever‑larger context windows. The new Mesh‑Attention algorithm re‑thinks how work is split among GPUs, turning the classic one‑dimensional “ring” layout into a two‑dimensional tile‑based schedule. The result is dramatically less network traffic and higher throughput, making multi‑hundred‑GPU deployments of LLMs more practical.
Key Contributions
- 2‑D Tile Scheduling: Introduces a matrix‑based model that assigns a rectangular tile of attention blocks to each GPU, reducing the communication‑to‑computation ratio (CommCom).
- Generalization of Ring‑Attention: Shows that the existing Ring‑Attention is just a special case of the broader tile‑based framework, enabling flexible trade‑offs between latency and bandwidth.
- Greedy Tile‑Search Algorithm: Provides an efficient, provably‑correct scheduler that finds near‑optimal tile shapes under realistic inter‑GPU communication constraints.
- Theoretical Communication Analysis: Proves that Mesh‑Attention’s communication complexity scales as (O(\sqrt{P})) (with (P) GPUs) versus the linear scaling of Ring‑Attention.
- Empirical Speedup & Bandwidth Savings: Demonstrates up to 3.4× speedup and 85 % reduction in data moved on a 256‑GPU cluster, with average gains of 2.9× speedup and 79 % traffic reduction.
Methodology
- Matrix‑Based Decomposition: The attention matrix (queries × keys) is split into a grid of blocks. Instead of giving each GPU a full row or column, Mesh‑Attention hands each GPU a tile—a contiguous sub‑matrix defined by a set of rows and columns.
- Tile Shape Tuning: By adjusting the tile’s height and width, developers can control how much data must be exchanged. Wider tiles reduce the number of column‑wise gathers; taller tiles reduce row‑wise broadcasts.
- Greedy Scheduler: The authors devise a lightweight greedy algorithm that walks the grid, allocating tiles while respecting GPU memory limits and ensuring that any required all‑to‑all communication stays within the physical network topology (e.g., NVLink mesh).
- Implementation Details: The algorithm is built on top of NCCL’s collective primitives, re‑using existing Ring‑Attention kernels where possible, but adding a lightweight “mesh‑reduction” step that aggregates partial results across the two dimensions of the tile.
Results & Findings
| # GPUs | Ring‑Attention (throughput) | Mesh‑Attention (throughput) | Speedup | Communication Volume Reduction |
|---|---|---|---|---|
| 64 | 1.0× (baseline) | 2.2× | 2.2× | 71 % |
| 128 | 1.0× | 2.8× | 2.8× | 77 % |
| 256 | 1.0× | 3.4× | 3.4× | 85 % |
- Scalability: As the GPU count grows, the communication overhead of Ring‑Attention becomes dominant, while Mesh‑Attention’s overhead grows sub‑linearly, keeping the system compute‑bound.
- Memory Footprint: Tile‑based partitioning respects per‑GPU memory limits, enabling models with > 1 TB of context to run on the same hardware that previously required model parallelism tricks.
- Robustness: The greedy scheduler consistently finds tile shapes within 5 % of the theoretical optimum across a variety of network topologies (ring, torus, fully‑connected).
Practical Implications
- Faster Inference for Long Contexts: Applications such as code assistants, document summarizers, or retrieval‑augmented generation can now process longer inputs without hitting a network bottleneck.
- Cost‑Effective Scaling: Reducing traffic by up to 85 % translates directly into lower cloud‑network charges and less pressure on inter‑connect fabric, extending the usable life of existing GPU clusters.
- Simplified Deployment: Because Mesh‑Attention builds on standard NCCL collectives, it can be dropped into existing PyTorch/DeepSpeed pipelines with minimal code changes—just replace the attention primitive.
- Enabling New Research: Researchers can experiment with context windows an order of magnitude larger, opening doors to better reasoning over documents, multi‑turn dialogues, and whole‑program analysis.
Limitations & Future Work
- Topology Sensitivity: The current greedy scheduler assumes a relatively uniform mesh or torus interconnect; performance may degrade on highly irregular or hierarchical networks (e.g., multi‑node clusters with mixed Ethernet/NVLink).
- Static Tile Shapes: Tile dimensions are chosen once per training/inference run; dynamic workloads with varying sequence lengths could benefit from adaptive tiling.
- Extension to Sparse/Flash Attention: The paper focuses on dense attention; integrating Mesh‑Attention with emerging sparse or kernel‑fused attention kernels remains an open challenge.
- Hardware Heterogeneity: Future work could explore how to balance tiles across GPUs with differing memory or compute capabilities (e.g., mixing A100 and H100).
Mesh‑Attention shows that re‑thinking data locality at the algorithmic level can unlock substantial performance gains for LLMs at scale—an insight that developers and infrastructure teams can start leveraging right away.
Authors
- Sirui Chen
- Jingji Chen
- Siqi Zhu
- Ziheng Jiang
- Yanghua Peng
- Xuehai Qian
Paper Information
- arXiv ID: 2512.20968v1
- Categories: cs.DC, cs.AI
- Published: December 24, 2025
- PDF: Download PDF