[Paper] Joint Training on AMD and NVIDIA GPUs

Published: 2 months ago (February 20, 2026 at 12:43 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.18007v1

Overview

The paper tackles a growing pain point for AI teams: training massive language models on clusters that mix AMD and NVIDIA GPUs. By designing communication pathways that let these different GPUs talk to each other efficiently, the authors show that mixed‑vendor setups can approach the performance of an all‑NVIDIA farm—opening the door to more flexible, cost‑effective hardware choices.

Key Contributions

Compatibility‑oriented “CPU‑Forwarding” communication – a hybrid stack that selects the optimal communication backend per parallel group and leverages multiple NICs for parallel data movement.
Device‑Direct Communication (DDC) scheme – a novel cross‑vendor peer‑to‑peer (P2P) mechanism that bypasses host memory, offloading the transfer work to the GPUs themselves.
Implementation in popular LLM training pipelines (LLaMA‑8B, Qwen2‑7B) demonstrating near‑native NVIDIA throughput (up to 98 %).
Comprehensive evaluation of training stability and correctness across heterogeneous hardware, confirming that the new communication paths do not introduce numerical drift.

Methodology

System Model – The authors assume a distributed data‑parallel training job spread over several nodes, each node potentially housing a mix of AMD and NVIDIA GPUs.
CPU‑Forwarding Layer –
- Each parallel group (e.g., a tensor‑parallel slice) picks the most suitable communication library (ROCm‑aware MPI, NCCL, or standard TCP) based on the vendor mix.
- Multi‑NIC aggregation is used so that traffic can be split across available network interfaces, reducing bottlenecks.
Device‑Direct Communication (DDC) –
- Introduces a lightweight GPU‑side P2P engine that can directly read/write remote GPU memory across the PCIe/InfinityFabric/NVLink boundary.
- A CPU‑offloading scheduler decides when to invoke DDC vs. fall back to CPU‑forwarding, based on message size and topology.
Integration with Existing Frameworks – The authors plug the new stack into PyTorch’s torch.distributed backend, keeping the user‑level API unchanged.
Benchmarking – Experiments run on 8‑GPU nodes (4 AMD + 4 NVIDIA) training LLaMA‑8B and Qwen2‑7B, comparing three setups: (a) all‑NVIDIA, (b) CPU‑forwarding only, (c) full DDC.

Results & Findings

Setup	Throughput (tokens/s)	% of All‑NVIDIA Baseline	Training Stability
All‑NVIDIA (homogeneous)	1.00× (reference)	100 %	No issues
CPU‑Forwarding only	0.71×	71 %	Stable
Device‑Direct Communication	0.98×	98 %	Stable, identical loss curves

Latency reduction: DDC cuts cross‑vendor transfer latency by ~45 % compared with staging through host memory.
Scalability: Adding more NICs linearly improves bandwidth up to the point where PCIe becomes the limiting factor.
Correctness: Numerical results (loss, perplexity) match those of a pure NVIDIA run within floating‑point tolerance, confirming no hidden precision loss.

Practical Implications

Cost‑flexible cluster design: Companies can mix cheaper AMD GPUs with existing NVIDIA inventory without sacrificing most of the performance, extending the life of legacy hardware.
Vendor‑agnostic cloud offerings: Cloud providers can expose “heterogeneous GPU pools” and let users spin up mixed‑vendor training jobs, improving resource utilization.
Simplified DevOps: Because the solution lives under the standard torch.distributed API, developers do not need to rewrite model code; they only need to configure the backend.
Future‑proofing for emerging GPUs: The architecture is extensible to other vendors (e.g., Intel Xe GPUs) by adding appropriate backend selectors, making it a reusable building block for next‑gen AI clusters.

Limitations & Future Work

Hardware dependency: DDC currently relies on PCIe peer‑to‑peer support that is not universally enabled on all server motherboards; some configurations may fall back to the slower CPU‑forwarding path.
Network topology constraints: The reported 98 % throughput assumes a high‑bandwidth, low‑latency interconnect (e.g., 200 Gbps Ethernet). In more modest networks the gains shrink.
Scalability beyond 8 GPUs per node: The paper focuses on a single mixed‑vendor node; extending the approach to multi‑node heterogeneous clusters will require additional coordination logic.
Future directions: The authors plan to (a) integrate RDMA‑based DDC for even lower latency, (b) automate backend selection with a learning‑based scheduler, and (c) evaluate the approach on emerging transformer‑style models that use pipeline parallelism.

Authors

Jon Hu
Thomas Jia
Jing Zhu
Zhendong Yu

Paper Information

arXiv ID: 2602.18007v1
Categories: cs.DC
Published: February 20, 2026
PDF: Download PDF

[Paper] Joint Training on AMD and NVIDIA GPUs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

AI in Multiple GPUs: How GPUs Communicate

Microsoft guide to pirating Harry Potter for LLM training (2024) [removed]

[Paper] DistributedEstimator: Distributed Training of Quantum Neural Networks via Circuit Cutting

OpenAI Calls In the Consultants For Its Enterprise Push