[Paper] Joint Training on AMD and NVIDIA GPUs
Source: arXiv - 2602.18007v1
Overview
The paper tackles a growing pain point for AI teams: training massive language models on clusters that mix AMD and NVIDIA GPUs. By designing communication pathways that let these different GPUs talk to each other efficiently, the authors show that mixed‑vendor setups can approach the performance of an all‑NVIDIA farm—opening the door to more flexible, cost‑effective hardware choices.
Key Contributions
- Compatibility‑oriented “CPU‑Forwarding” communication – a hybrid stack that selects the optimal communication backend per parallel group and leverages multiple NICs for parallel data movement.
- Device‑Direct Communication (DDC) scheme – a novel cross‑vendor peer‑to‑peer (P2P) mechanism that bypasses host memory, offloading the transfer work to the GPUs themselves.
- Implementation in popular LLM training pipelines (LLaMA‑8B, Qwen2‑7B) demonstrating near‑native NVIDIA throughput (up to 98 %).
- Comprehensive evaluation of training stability and correctness across heterogeneous hardware, confirming that the new communication paths do not introduce numerical drift.
Methodology
- System Model – The authors assume a distributed data‑parallel training job spread over several nodes, each node potentially housing a mix of AMD and NVIDIA GPUs.
- CPU‑Forwarding Layer –
- Each parallel group (e.g., a tensor‑parallel slice) picks the most suitable communication library (ROCm‑aware MPI, NCCL, or standard TCP) based on the vendor mix.
- Multi‑NIC aggregation is used so that traffic can be split across available network interfaces, reducing bottlenecks.
- Device‑Direct Communication (DDC) –
- Introduces a lightweight GPU‑side P2P engine that can directly read/write remote GPU memory across the PCIe/InfinityFabric/NVLink boundary.
- A CPU‑offloading scheduler decides when to invoke DDC vs. fall back to CPU‑forwarding, based on message size and topology.
- Integration with Existing Frameworks – The authors plug the new stack into PyTorch’s
torch.distributedbackend, keeping the user‑level API unchanged. - Benchmarking – Experiments run on 8‑GPU nodes (4 AMD + 4 NVIDIA) training LLaMA‑8B and Qwen2‑7B, comparing three setups: (a) all‑NVIDIA, (b) CPU‑forwarding only, (c) full DDC.
Results & Findings
| Setup | Throughput (tokens/s) | % of All‑NVIDIA Baseline | Training Stability |
|---|---|---|---|
| All‑NVIDIA (homogeneous) | 1.00× (reference) | 100 % | No issues |
| CPU‑Forwarding only | 0.71× | 71 % | Stable |
| Device‑Direct Communication | 0.98× | 98 % | Stable, identical loss curves |
- Latency reduction: DDC cuts cross‑vendor transfer latency by ~45 % compared with staging through host memory.
- Scalability: Adding more NICs linearly improves bandwidth up to the point where PCIe becomes the limiting factor.
- Correctness: Numerical results (loss, perplexity) match those of a pure NVIDIA run within floating‑point tolerance, confirming no hidden precision loss.
Practical Implications
- Cost‑flexible cluster design: Companies can mix cheaper AMD GPUs with existing NVIDIA inventory without sacrificing most of the performance, extending the life of legacy hardware.
- Vendor‑agnostic cloud offerings: Cloud providers can expose “heterogeneous GPU pools” and let users spin up mixed‑vendor training jobs, improving resource utilization.
- Simplified DevOps: Because the solution lives under the standard
torch.distributedAPI, developers do not need to rewrite model code; they only need to configure the backend. - Future‑proofing for emerging GPUs: The architecture is extensible to other vendors (e.g., Intel Xe GPUs) by adding appropriate backend selectors, making it a reusable building block for next‑gen AI clusters.
Limitations & Future Work
- Hardware dependency: DDC currently relies on PCIe peer‑to‑peer support that is not universally enabled on all server motherboards; some configurations may fall back to the slower CPU‑forwarding path.
- Network topology constraints: The reported 98 % throughput assumes a high‑bandwidth, low‑latency interconnect (e.g., 200 Gbps Ethernet). In more modest networks the gains shrink.
- Scalability beyond 8 GPUs per node: The paper focuses on a single mixed‑vendor node; extending the approach to multi‑node heterogeneous clusters will require additional coordination logic.
- Future directions: The authors plan to (a) integrate RDMA‑based DDC for even lower latency, (b) automate backend selection with a learning‑based scheduler, and (c) evaluate the approach on emerging transformer‑style models that use pipeline parallelism.
Authors
- Jon Hu
- Thomas Jia
- Jing Zhu
- Zhendong Yu
Paper Information
- arXiv ID: 2602.18007v1
- Categories: cs.DC
- Published: February 20, 2026
- PDF: Download PDF