[Paper] TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training

Published: 2 days ago (April 27, 2026 at 02:27 AM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.24088v1

Overview

Training today’s massive language models often relies on tensor‑parallelism (TP), which splits large weight matrices across many GPUs. While TP scales compute, it also forces frequent exchange of huge intermediate tensors, creating a communication bottleneck. The paper introduces TACO – a lightweight, FP8‑based compression framework that slashes the volume of TP traffic without sacrificing model quality, delivering up to 1.87× faster end‑to‑end training on GPT‑style and Qwen models.

Key Contributions

FP8‑centric compression pipeline that combines a data‑driven reshaping step with an Adaptive Scale‑Hadamard Transform (ASHT) for high‑fidelity quantization of intermediate tensors.
Dual‑Scale Quantization (DSQ) mechanism that preserves numerical stability across the entire training run, preventing overflow/underflow that typically plagues low‑precision schemes.
Highly fused compression operator that merges reshaping, scaling, and quantization into a single GPU kernel, dramatically reducing memory traffic and kernel‑launch overhead.
Seamless integration with existing data‑parallel (DP) and pipeline‑parallel (PP) runtimes, yielding a 3‑D parallel training stack (DP × PP × TP) that can be dropped into popular frameworks (e.g., Megatron‑LM, DeepSpeed).
Extensive empirical validation on GPT‑2/3‑scale models and the Qwen family, showing near‑lossless perplexity/accuracy while boosting throughput by up to 1.87×.

Methodology

Reshaping & Distribution Awareness – Before compression, each intermediate tensor is rearranged based on its empirical value distribution (learned from a short calibration run). This “data‑driven reshaping” concentrates the bulk of the signal into a smaller sub‑space, making subsequent quantization more effective.
Adaptive Scale‑Hadamard Transform (ASHT) – A lightweight orthogonal transform (Hadamard) is applied with per‑tensor scaling factors that adapt to the dynamic range observed during training. The transform decorrelates the data, further tightening the distribution around zero.
FP8 Quantization + Dual‑Scale Quantization – The transformed tensor is quantized to 8‑bit floating‑point (FP8). DSQ keeps two scaling factors (one for the forward pass, one for the backward pass) so that gradients and activations retain sufficient precision even when the same compressed representation is reused.
Fused Compression Kernel – All steps (reshape → ASHT → scaling → FP8 cast) are implemented in a single CUDA kernel, eliminating intermediate buffers and allowing the kernel to run concurrently with NCCL communication.
3‑D Parallel Integration – TACO’s compression/decompression hooks replace the default all‑reduce/all‑gather calls in the TP layer of existing 3‑D parallel trainers, leaving DP and PP logic untouched.

Results & Findings

Model	#GPUs	Baseline TP Throughput	TACO Throughput	Speed‑up	Final Accuracy (PPL / BLEU)
GPT‑2‑1.5B	64	1.02 TFLOP/s	1.84 TFLOP/s	1.80×	≈ unchanged
GPT‑3‑6.7B	128	0.58 TFLOP/s	1.09 TFLOP/s	1.87×	≈ unchanged
Qwen‑7B	256	0.42 TFLOP/s	0.73 TFLOP/s	1.74×	≈ unchanged

Communication volume dropped by ~45 % on average after FP8 compression.
Kernel launch overhead reduced by ~30 % thanks to the fused operator.
Training stability remained comparable to full‑precision TP; loss curves overlapped almost perfectly.
The approach works across both decoder‑only (GPT) and encoder‑decoder (Qwen) architectures, demonstrating broad applicability.

Practical Implications

Cost Savings: By cutting inter‑GPU traffic, cloud users can train larger models on the same hardware budget or finish training cycles faster, reducing GPU‑hour expenses.
Scalability: TACO makes it feasible to push TP beyond the usual 64‑GPU ceiling without hitting a network bottleneck, opening the door to truly petascale LLMs on commodity clusters.
Framework Adoption: Because TACO is implemented as a drop‑in replacement for the TP communication primitives, developers using Megatron‑LM, DeepSpeed, or FairScale can enable it with minimal code changes.
Edge‑to‑Cloud Continuity: The same FP8 quantization pipeline could be repurposed for inference‑time tensor compression (e.g., model parallel inference on multi‑node edge clusters), potentially halving latency.
Hardware Alignment: FP8 is now supported on NVIDIA Hopper and upcoming AMD GPUs, so TACO can exploit native tensor‑core acceleration for the quantization/dequantization steps, further boosting performance.

Limitations & Future Work

Hardware Dependency: The current implementation assumes GPUs with fast FP8 support; older hardware would fall back to emulated FP8, diminishing gains.
Calibration Overhead: The data‑driven reshaping requires a short calibration phase at the start of training; while modest, it adds a step that may need automation for fully dynamic workloads.
Extending Beyond TP: The paper focuses on TP tensors; applying the same compression ideas to DP gradients or PP activations remains an open question.
Robustness to Extreme Scaling: Experiments stop at 256 GPUs; future work should verify stability and speed‑up when scaling to thousands of nodes, where network topology effects become more pronounced.

Overall, TACO offers a pragmatic, high‑impact solution for the communication bottleneck that has long hampered tensor‑parallel LLM training, and it paves the way for more cost‑effective, large‑scale model development.

Authors

Man Liu
Xingchen Liu
Xingjian Tian
Bing Lu
Shengkay Lyu
Shengquan Yin
Wenjing Huang
Zheng Wei
Hairui Zhao
Guangming Tan
Dingwen Tao

Paper Information

arXiv ID: 2604.24088v1
Categories: cs.DC, cs.AI
Published: April 27, 2026
PDF: Download PDF

[Paper] TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Recursive Multi-Agent Systems

[Paper] How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

[Paper] Teacher Forcing as Generalized Bayes: Optimization Geometry Mismatch in Switching Surrogates for Chaotic Dynamics

[Paper] Carbon-Taxed Transformers: A Green Compression Pipeline for Overgrown Language Models