[Paper] HetCCL: Accelerating LLM Training with Heterogeneous GPUs

Published: (January 30, 2026 at 12:31 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2601.22585v1

Overview

The paper introduces HetCCL, a new collective‑communication library that lets large‑language‑model (LLM) training run efficiently on GPU clusters built from a mix of NVIDIA and AMD cards. By bridging the gap between vendor‑specific communication stacks (NCCL and RCCL) without touching drivers, HetCCL makes heterogeneous GPU farms practical, cutting both time‑to‑train and hardware costs.

Key Contributions

  • Unified communication layer that transparently combines NVIDIA’s NCCL and AMD’s RCCL, enabling RDMA‑based data exchange across different GPU vendors.
  • Two novel cross‑vendor mechanisms: (1) a backend‑agnostic routing shim that forwards collective calls to the appropriate vendor library, and (2) an RDMA‑accelerated transport that bypasses the host‑CPU bottleneck while preserving vendor‑level optimizations.
  • Zero‑code‑change integration: existing PyTorch/TensorFlow training scripts run unchanged on heterogeneous clusters.
  • Performance parity with native NCCL/RCCL in homogeneous settings and up to 1.3× speed‑up in mixed‑vendor configurations.
  • Open‑source implementation and a lightweight API that can be dropped into any standard deep‑learning framework.

Methodology

  1. Backend Abstraction – HetCCL defines a thin abstraction layer that detects the GPU vendor at runtime and routes collective operations (e.g., all‑reduce, broadcast) to the matching vendor library.
  2. RDMA Transport Engine – Instead of relying on PCIe‑host memory copies, HetCCL leverages InfiniBand/RoCE RDMA to move tensors directly between GPU memories across nodes, regardless of vendor.
  3. Hybrid Scheduling – For a given collective, HetCCL partitions the participating GPUs into homogeneous sub‑groups (NVIDIA‑only, AMD‑only) that use their native libraries, then stitches the sub‑results together via the RDMA engine.
  4. Evaluation Setup – The authors built a 16‑node cluster (8 × NVIDIA A100, 8 × AMD MI250) connected via 200 Gb/s InfiniBand. They benchmarked standard LLM training kernels (BERT‑large, GPT‑2‑XL) and measured end‑to‑end training throughput, latency of collective ops, and scaling efficiency.

Results & Findings

ScenarioBaseline (NCCL/RCCL)HetCCLSpeed‑up vs. Baseline
Homogeneous NVIDIA (8 A100)1.00× (NCCL)0.99×–1 %
Homogeneous AMD (8 MI250)1.00× (RCCL)1.01×+1 %
Mixed (4 A100 + 4 MI250)NCCL‑only or RCCL‑only (inefficient)HetCCL1.22× (overall)
End‑to‑end GPT‑2‑XL training (tokens/s)12.4 K15.3 K+23 %
All‑reduce latency (256 MiB)1.8 ms (NCCL) / 2.0 ms (RCCL)1.9 ms≈ baseline
  • Parity in homogeneous clusters shows HetCCL adds negligible overhead.
  • Cross‑vendor scaling is the differentiator: HetCCL avoids the “slow path” of forcing all GPUs to use a single vendor’s library, which would otherwise stall the faster devices.
  • Training cost reduction: By allowing organizations to mix older AMD cards with newer NVIDIA GPUs, total hardware spend can be lowered by up to 30 % while keeping training time competitive.

Practical Implications

  • Cost‑effective GPU farms – Companies can extend existing AMD GPU investments rather than buying an all‑NVIDIA refresh, accelerating ROI on prior capital expenditures.
  • Simplified DevOps – No need to rewrite training scripts or maintain separate clusters; HetCCL’s drop‑in API works with the same PyTorch/TensorFlow codebases.
  • Cloud‑provider flexibility – Multi‑tenant cloud services that expose both NVIDIA and AMD instances can now offer “heterogeneous” VM families without sacrificing performance, opening up new pricing tiers.
  • Future‑proofing – As newer vendors (e.g., Intel Xe‑HP) enter the market, the same abstraction pattern can be extended, protecting investments against vendor lock‑in.
  • Research acceleration – Academic labs with limited budgets can assemble mixed‑GPU clusters to train LLMs that would otherwise be out of reach, fostering more rapid experimentation.

Limitations & Future Work

  • RDMA dependency – HetCCL’s performance gains rely on high‑speed RDMA fabrics; on Ethernet‑only clusters the benefits diminish.
  • Vendor library updates – Compatibility must be re‑validated whenever NCCL or RCCL releases a major version, requiring ongoing maintenance.
  • Scalability beyond 16 nodes – The paper evaluates up to 16 nodes; larger‑scale tests (hundreds of GPUs) are left for future exploration.
  • Support for emerging interconnects – Extending the transport engine to leverage NVIDIA’s NVLink‑2 or AMD’s Infinity Fabric across nodes is an open research direction.

Overall, HetCCL demonstrates that heterogeneous GPU clusters are not just a theoretical possibility but a practical, high‑performance solution for today’s LLM training workloads.

Authors

  • Heehoon Kim
  • Jaehwan Lee
  • Taejeoung Kim
  • Jongwon Park
  • Jinpyo Kim
  • Pyongwon Suh
  • Ryan H. Choi
  • Sangwoo Lee
  • Jaejin Lee

Paper Information

  • arXiv ID: 2601.22585v1
  • Categories: cs.DC, cs.LG
  • Published: January 30, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »