[Paper] HetCCL: Accelerating LLM Training with Heterogeneous GPUs

Published: 3 months ago (January 30, 2026 at 12:31 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.22585v1

Overview

The paper introduces HetCCL, a new collective‑communication library that lets large‑language‑model (LLM) training run efficiently on GPU clusters built from a mix of NVIDIA and AMD cards. By bridging the gap between vendor‑specific communication stacks (NCCL and RCCL) without touching drivers, HetCCL makes heterogeneous GPU farms practical, cutting both time‑to‑train and hardware costs.

Key Contributions

Unified communication layer that transparently combines NVIDIA’s NCCL and AMD’s RCCL, enabling RDMA‑based data exchange across different GPU vendors.
Two novel cross‑vendor mechanisms: (1) a backend‑agnostic routing shim that forwards collective calls to the appropriate vendor library, and (2) an RDMA‑accelerated transport that bypasses the host‑CPU bottleneck while preserving vendor‑level optimizations.
Zero‑code‑change integration: existing PyTorch/TensorFlow training scripts run unchanged on heterogeneous clusters.
Performance parity with native NCCL/RCCL in homogeneous settings and up to 1.3× speed‑up in mixed‑vendor configurations.
Open‑source implementation and a lightweight API that can be dropped into any standard deep‑learning framework.

Methodology

Backend Abstraction – HetCCL defines a thin abstraction layer that detects the GPU vendor at runtime and routes collective operations (e.g., all‑reduce, broadcast) to the matching vendor library.
RDMA Transport Engine – Instead of relying on PCIe‑host memory copies, HetCCL leverages InfiniBand/RoCE RDMA to move tensors directly between GPU memories across nodes, regardless of vendor.
Hybrid Scheduling – For a given collective, HetCCL partitions the participating GPUs into homogeneous sub‑groups (NVIDIA‑only, AMD‑only) that use their native libraries, then stitches the sub‑results together via the RDMA engine.
Evaluation Setup – The authors built a 16‑node cluster (8 × NVIDIA A100, 8 × AMD MI250) connected via 200 Gb/s InfiniBand. They benchmarked standard LLM training kernels (BERT‑large, GPT‑2‑XL) and measured end‑to‑end training throughput, latency of collective ops, and scaling efficiency.

Results & Findings

Scenario	Baseline (NCCL/RCCL)	HetCCL	Speed‑up vs. Baseline
Homogeneous NVIDIA (8 A100)	1.00× (NCCL)	0.99×	–1 %
Homogeneous AMD (8 MI250)	1.00× (RCCL)	1.01×	+1 %
Mixed (4 A100 + 4 MI250)	NCCL‑only or RCCL‑only (inefficient)	HetCCL	1.22× (overall)
End‑to‑end GPT‑2‑XL training (tokens/s)	12.4 K	15.3 K	+23 %
All‑reduce latency (256 MiB)	1.8 ms (NCCL) / 2.0 ms (RCCL)	1.9 ms	≈ baseline

Parity in homogeneous clusters shows HetCCL adds negligible overhead.
Cross‑vendor scaling is the differentiator: HetCCL avoids the “slow path” of forcing all GPUs to use a single vendor’s library, which would otherwise stall the faster devices.
Training cost reduction: By allowing organizations to mix older AMD cards with newer NVIDIA GPUs, total hardware spend can be lowered by up to 30 % while keeping training time competitive.

Practical Implications

Cost‑effective GPU farms – Companies can extend existing AMD GPU investments rather than buying an all‑NVIDIA refresh, accelerating ROI on prior capital expenditures.
Simplified DevOps – No need to rewrite training scripts or maintain separate clusters; HetCCL’s drop‑in API works with the same PyTorch/TensorFlow codebases.
Cloud‑provider flexibility – Multi‑tenant cloud services that expose both NVIDIA and AMD instances can now offer “heterogeneous” VM families without sacrificing performance, opening up new pricing tiers.
Future‑proofing – As newer vendors (e.g., Intel Xe‑HP) enter the market, the same abstraction pattern can be extended, protecting investments against vendor lock‑in.
Research acceleration – Academic labs with limited budgets can assemble mixed‑GPU clusters to train LLMs that would otherwise be out of reach, fostering more rapid experimentation.

Limitations & Future Work

RDMA dependency – HetCCL’s performance gains rely on high‑speed RDMA fabrics; on Ethernet‑only clusters the benefits diminish.
Vendor library updates – Compatibility must be re‑validated whenever NCCL or RCCL releases a major version, requiring ongoing maintenance.
Scalability beyond 16 nodes – The paper evaluates up to 16 nodes; larger‑scale tests (hundreds of GPUs) are left for future exploration.
Support for emerging interconnects – Extending the transport engine to leverage NVIDIA’s NVLink‑2 or AMD’s Infinity Fabric across nodes is an open research direction.

Overall, HetCCL demonstrates that heterogeneous GPU clusters are not just a theoretical possibility but a practical, high‑performance solution for today’s LLM training workloads.

Authors

Heehoon Kim
Jaehwan Lee
Taejeoung Kim
Jongwon Park
Jinpyo Kim
Pyongwon Suh
Ryan H. Choi
Sangwoo Lee
Jaejin Lee

Paper Information

arXiv ID: 2601.22585v1
Categories: cs.DC, cs.LG
Published: January 30, 2026
PDF: Download PDF

[Paper] HetCCL: Accelerating LLM Training with Heterogeneous GPUs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

[Paper] End-to-end Optimization of Belief and Policy Learning in Shared Autonomy Paradigms

[Paper] Decoupled Diffusion Sampling for Inverse Problems on Function Spaces

[Paper] FOCUS: DLLMs Know How to Tame Their Compute Bound