[Paper] Lagom: Unleashing the Power of Communication and Computation Overlapping for Distributed LLM Training
Source: arXiv - 2602.20656v1
Overview
The paper introduces Lagom, a system that intelligently overlaps communication and computation during distributed training of large language models (LLMs). By co‑tuning communication parameters, Lagom automatically finds a sweet spot where neither the network nor the GPUs become a bottleneck, delivering measurable speed‑ups on both high‑ and low‑bandwidth GPU clusters.
Key Contributions
- Unified Cost Model – A single analytical model that captures both compute and communication costs, enabling direct comparison of different parallelization strategies.
- Priority‑Based Search Algorithm – Reduces the combinatorial explosion of possible parameter settings from exponential to linear time, making runtime tuning practical.
- Co‑tuning of Communication Parameters – Simultaneously adjusts message size, aggregation depth, and scheduling priority to keep GPUs busy while the network works in the background.
- Broad Evaluation – Demonstrates consistent speed‑ups (1.03‑1.33×) across a variety of LLMs (GPT‑2, BERT, T5) and parallelism schemes (data, tensor, pipeline) on both 100 Gbps and 25 Gbps GPU clusters.
- Open‑Source Prototype – The implementation builds on NCCL/AutoCCL and can be dropped into existing PyTorch/Docker pipelines with minimal code changes.
Methodology
-
Profiling Phase – Lagom first runs a short micro‑benchmark on the target cluster to measure raw compute throughput (FLOPs per second) and network characteristics (latency, bandwidth, contention).
-
Cost Modeling – Using these measurements, the system constructs a cost equation:
[ \text{Total Time} = \frac{\text{Compute Work}}{\text{Compute Rate}} + \frac{\text{Comm Volume}}{\text{Effective Bandwidth}} + \text{Overlap Penalty} ]
The “overlap penalty” quantifies how much communication stalls the compute pipeline.
-
Parameter Space Definition – Lagom defines a set of tunable knobs:
- Chunk size (how many tensors are packed together)
- Aggregation depth (how many reduction steps are pipelined)
- Priority levels (which tensors get sent first)
-
Priority‑Based Search – Instead of enumerating all combinations, Lagom ranks knobs by their marginal impact on the cost model and greedily explores the most promising ones. The search stops once the marginal gain falls below a threshold, guaranteeing linear runtime.
-
Runtime Adaptation – During training, Lagom monitors actual overlap efficiency and can re‑invoke the search if the workload or network conditions drift.
Results & Findings
| Cluster | Model / Parallelism | Baseline (NCCL) | Baseline (AutoCCL) | Lagom |
|---|---|---|---|---|
| 100 Gbps (8×A100) | GPT‑2, tensor parallel 8 | 1.00× | 1.08× | 1.33× |
| 25 Gbps (4×V100) | BERT, pipeline 4 | 1.00× | 1.03× | 1.27× |
| Mixed (data + tensor) | T5, 16‑GPU | 1.00× | 1.07× | 1.20× |
- Communication‑bound regimes (large tensor‑parallel degrees) saw the biggest gains because Lagom could pack more tensors into fewer network calls.
- Computation‑bound regimes still benefited (≈3‑7 % speed‑up) by reducing idle GPU time caused by occasional network stalls.
- The linear‑time search added < 2 % overhead to total training time, confirming its practicality.
Practical Implications
- Faster Model Iteration – Teams can shave days off multi‑week LLM pre‑training runs without buying new hardware.
- Cost Savings on Cloud – Better overlap translates to lower GPU‑hour consumption, especially on spot‑instance fleets where network quality varies.
- Simplified Ops – Lagom’s auto‑tuning removes the need for manual “hand‑tuning” of NCCL parameters, a pain point for DevOps engineers scaling out to 64‑GPU pods.
- Portability – Because Lagom works on top of standard NCCL/AutoCCL, it can be integrated into existing PyTorch
torch.distributedscripts with a singlelagom.init()call. - Edge Cases – In low‑bandwidth on‑prem clusters (e.g., 10 Gbps Ethernet), Lagom’s ability to prioritize critical gradients can keep training stable where naive scaling would otherwise diverge.
Limitations & Future Work
- Model‑Specific Tuning – The cost model assumes relatively static compute/communication ratios; highly dynamic workloads (e.g., adaptive sparsity) may need more frequent re‑tuning.
- Hardware Diversity – Evaluations focused on NVIDIA GPUs and NCCL; extending to AMD HIP or TPU interconnects will require additional profiling hooks.
- Scalability Beyond 64 GPUs – While the linear search scales well, the paper does not report results on > 128‑GPU clusters where network topology (e.g., fat‑tree vs. dragonfly) could introduce new bottlenecks.
- Integration with Scheduler – Future work could expose Lagom’s cost model to cluster schedulers for joint job placement and communication‑aware resource allocation.
Authors
- Guanbin Xu
- ZhenGuo Xu
- Yuzhe Li
- Youhui Bai
- Ping Gong
- Chaoyi Ruan
- Cheng Li
Paper Information
- arXiv ID: 2602.20656v1
- Categories: cs.DC
- Published: February 24, 2026
- PDF: Download PDF