[Paper] Lagom: Unleashing the Power of Communication and Computation Overlapping for Distributed LLM Training

Published: 3 days ago (February 24, 2026 at 03:00 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.20656v1

Overview

The paper introduces Lagom, a system that intelligently overlaps communication and computation during distributed training of large language models (LLMs). By co‑tuning communication parameters, Lagom automatically finds a sweet spot where neither the network nor the GPUs become a bottleneck, delivering measurable speed‑ups on both high‑ and low‑bandwidth GPU clusters.

Key Contributions

Unified Cost Model – A single analytical model that captures both compute and communication costs, enabling direct comparison of different parallelization strategies.
Priority‑Based Search Algorithm – Reduces the combinatorial explosion of possible parameter settings from exponential to linear time, making runtime tuning practical.
Co‑tuning of Communication Parameters – Simultaneously adjusts message size, aggregation depth, and scheduling priority to keep GPUs busy while the network works in the background.
Broad Evaluation – Demonstrates consistent speed‑ups (1.03‑1.33×) across a variety of LLMs (GPT‑2, BERT, T5) and parallelism schemes (data, tensor, pipeline) on both 100 Gbps and 25 Gbps GPU clusters.
Open‑Source Prototype – The implementation builds on NCCL/AutoCCL and can be dropped into existing PyTorch/Docker pipelines with minimal code changes.

Methodology

Profiling Phase – Lagom first runs a short micro‑benchmark on the target cluster to measure raw compute throughput (FLOPs per second) and network characteristics (latency, bandwidth, contention).
Cost Modeling – Using these measurements, the system constructs a cost equation:

[ \text{Total Time} = \frac{\text{Compute Work}}{\text{Compute Rate}} + \frac{\text{Comm Volume}}{\text{Effective Bandwidth}} + \text{Overlap Penalty} ]

The “overlap penalty” quantifies how much communication stalls the compute pipeline.
Parameter Space Definition – Lagom defines a set of tunable knobs:
- Chunk size (how many tensors are packed together)
- Aggregation depth (how many reduction steps are pipelined)
- Priority levels (which tensors get sent first)
Priority‑Based Search – Instead of enumerating all combinations, Lagom ranks knobs by their marginal impact on the cost model and greedily explores the most promising ones. The search stops once the marginal gain falls below a threshold, guaranteeing linear runtime.
Runtime Adaptation – During training, Lagom monitors actual overlap efficiency and can re‑invoke the search if the workload or network conditions drift.

Results & Findings

Cluster	Model / Parallelism	Baseline (NCCL)	Baseline (AutoCCL)	Lagom
100 Gbps (8×A100)	GPT‑2, tensor parallel 8	1.00×	1.08×	1.33×
25 Gbps (4×V100)	BERT, pipeline 4	1.00×	1.03×	1.27×
Mixed (data + tensor)	T5, 16‑GPU	1.00×	1.07×	1.20×

Communication‑bound regimes (large tensor‑parallel degrees) saw the biggest gains because Lagom could pack more tensors into fewer network calls.
Computation‑bound regimes still benefited (≈3‑7 % speed‑up) by reducing idle GPU time caused by occasional network stalls.
The linear‑time search added < 2 % overhead to total training time, confirming its practicality.

Practical Implications

Faster Model Iteration – Teams can shave days off multi‑week LLM pre‑training runs without buying new hardware.
Cost Savings on Cloud – Better overlap translates to lower GPU‑hour consumption, especially on spot‑instance fleets where network quality varies.
Simplified Ops – Lagom’s auto‑tuning removes the need for manual “hand‑tuning” of NCCL parameters, a pain point for DevOps engineers scaling out to 64‑GPU pods.
Portability – Because Lagom works on top of standard NCCL/AutoCCL, it can be integrated into existing PyTorch torch.distributed scripts with a single lagom.init() call.
Edge Cases – In low‑bandwidth on‑prem clusters (e.g., 10 Gbps Ethernet), Lagom’s ability to prioritize critical gradients can keep training stable where naive scaling would otherwise diverge.

Limitations & Future Work

Model‑Specific Tuning – The cost model assumes relatively static compute/communication ratios; highly dynamic workloads (e.g., adaptive sparsity) may need more frequent re‑tuning.
Hardware Diversity – Evaluations focused on NVIDIA GPUs and NCCL; extending to AMD HIP or TPU interconnects will require additional profiling hooks.
Scalability Beyond 64 GPUs – While the linear search scales well, the paper does not report results on > 128‑GPU clusters where network topology (e.g., fat‑tree vs. dragonfly) could introduce new bottlenecks.
Integration with Scheduler – Future work could expose Lagom’s cost model to cluster schedulers for joint job placement and communication‑aware resource allocation.

Authors

Guanbin Xu
ZhenGuo Xu
Yuzhe Li
Youhui Bai
Ping Gong
Chaoyi Ruan
Cheng Li

Paper Information

arXiv ID: 2602.20656v1
Categories: cs.DC
Published: February 24, 2026
PDF: Download PDF

[Paper] Lagom: Unleashing the Power of Communication and Computation Overlapping for Distributed LLM Training

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Exploiting network topology in brain-scale simulations of spiking neural networks

[Paper] STELLAR: Storage Tuning Engine Leveraging LLM Autonomous Reasoning for High Performance Parallel File Systems

[Paper] A High-Throughput AES-GCM Implementation on GPUs for Secure, Policy-Based Access to Massive Astronomical Catalogs

[Paper] A Simple Distributed Deterministic Planar Separator