[Paper] TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training
Source: arXiv - 2604.24088v1
Overview
Training today’s massive language models often relies on tensor‑parallelism (TP), which splits large weight matrices across many GPUs. While TP scales compute, it also forces frequent exchange of huge intermediate tensors, creating a communication bottleneck. The paper introduces TACO – a lightweight, FP8‑based compression framework that slashes the volume of TP traffic without sacrificing model quality, delivering up to 1.87× faster end‑to‑end training on GPT‑style and Qwen models.
Key Contributions
- FP8‑centric compression pipeline that combines a data‑driven reshaping step with an Adaptive Scale‑Hadamard Transform (ASHT) for high‑fidelity quantization of intermediate tensors.
- Dual‑Scale Quantization (DSQ) mechanism that preserves numerical stability across the entire training run, preventing overflow/underflow that typically plagues low‑precision schemes.
- Highly fused compression operator that merges reshaping, scaling, and quantization into a single GPU kernel, dramatically reducing memory traffic and kernel‑launch overhead.
- Seamless integration with existing data‑parallel (DP) and pipeline‑parallel (PP) runtimes, yielding a 3‑D parallel training stack (DP × PP × TP) that can be dropped into popular frameworks (e.g., Megatron‑LM, DeepSpeed).
- Extensive empirical validation on GPT‑2/3‑scale models and the Qwen family, showing near‑lossless perplexity/accuracy while boosting throughput by up to 1.87×.
Methodology
- Reshaping & Distribution Awareness – Before compression, each intermediate tensor is rearranged based on its empirical value distribution (learned from a short calibration run). This “data‑driven reshaping” concentrates the bulk of the signal into a smaller sub‑space, making subsequent quantization more effective.
- Adaptive Scale‑Hadamard Transform (ASHT) – A lightweight orthogonal transform (Hadamard) is applied with per‑tensor scaling factors that adapt to the dynamic range observed during training. The transform decorrelates the data, further tightening the distribution around zero.
- FP8 Quantization + Dual‑Scale Quantization – The transformed tensor is quantized to 8‑bit floating‑point (FP8). DSQ keeps two scaling factors (one for the forward pass, one for the backward pass) so that gradients and activations retain sufficient precision even when the same compressed representation is reused.
- Fused Compression Kernel – All steps (reshape → ASHT → scaling → FP8 cast) are implemented in a single CUDA kernel, eliminating intermediate buffers and allowing the kernel to run concurrently with NCCL communication.
- 3‑D Parallel Integration – TACO’s compression/decompression hooks replace the default all‑reduce/all‑gather calls in the TP layer of existing 3‑D parallel trainers, leaving DP and PP logic untouched.
Results & Findings
| Model | #GPUs | Baseline TP Throughput | TACO Throughput | Speed‑up | Final Accuracy (PPL / BLEU) |
|---|---|---|---|---|---|
| GPT‑2‑1.5B | 64 | 1.02 TFLOP/s | 1.84 TFLOP/s | 1.80× | ≈ unchanged |
| GPT‑3‑6.7B | 128 | 0.58 TFLOP/s | 1.09 TFLOP/s | 1.87× | ≈ unchanged |
| Qwen‑7B | 256 | 0.42 TFLOP/s | 0.73 TFLOP/s | 1.74× | ≈ unchanged |
- Communication volume dropped by ~45 % on average after FP8 compression.
- Kernel launch overhead reduced by ~30 % thanks to the fused operator.
- Training stability remained comparable to full‑precision TP; loss curves overlapped almost perfectly.
- The approach works across both decoder‑only (GPT) and encoder‑decoder (Qwen) architectures, demonstrating broad applicability.
Practical Implications
- Cost Savings: By cutting inter‑GPU traffic, cloud users can train larger models on the same hardware budget or finish training cycles faster, reducing GPU‑hour expenses.
- Scalability: TACO makes it feasible to push TP beyond the usual 64‑GPU ceiling without hitting a network bottleneck, opening the door to truly petascale LLMs on commodity clusters.
- Framework Adoption: Because TACO is implemented as a drop‑in replacement for the TP communication primitives, developers using Megatron‑LM, DeepSpeed, or FairScale can enable it with minimal code changes.
- Edge‑to‑Cloud Continuity: The same FP8 quantization pipeline could be repurposed for inference‑time tensor compression (e.g., model parallel inference on multi‑node edge clusters), potentially halving latency.
- Hardware Alignment: FP8 is now supported on NVIDIA Hopper and upcoming AMD GPUs, so TACO can exploit native tensor‑core acceleration for the quantization/dequantization steps, further boosting performance.
Limitations & Future Work
- Hardware Dependency: The current implementation assumes GPUs with fast FP8 support; older hardware would fall back to emulated FP8, diminishing gains.
- Calibration Overhead: The data‑driven reshaping requires a short calibration phase at the start of training; while modest, it adds a step that may need automation for fully dynamic workloads.
- Extending Beyond TP: The paper focuses on TP tensors; applying the same compression ideas to DP gradients or PP activations remains an open question.
- Robustness to Extreme Scaling: Experiments stop at 256 GPUs; future work should verify stability and speed‑up when scaling to thousands of nodes, where network topology effects become more pronounced.
Overall, TACO offers a pragmatic, high‑impact solution for the communication bottleneck that has long hampered tensor‑parallel LLM training, and it paves the way for more cost‑effective, large‑scale model development.
Authors
- Man Liu
- Xingchen Liu
- Xingjian Tian
- Bing Lu
- Shengkay Lyu
- Shengquan Yin
- Wenjing Huang
- Zheng Wei
- Hairui Zhao
- Guangming Tan
- Dingwen Tao
Paper Information
- arXiv ID: 2604.24088v1
- Categories: cs.DC, cs.AI
- Published: April 27, 2026
- PDF: Download PDF