AI in Multiple GPUs: How GPUs Communicate
Source: Towards Data Science
Series: Distributed AI Across Multiple GPUs
- Part 1: Understanding the Host and Device Paradigm
- Part 2: Point‑to‑Point and Collective Operations
- Part 3: How GPUs Communicate — (this article)
- Part 4: Gradient Accumulation & Distributed Data Parallelism (DDP) — coming soon
- Part 5: ZeRO — coming soon
- Part 6: Tensor Parallelism — coming soon
Introduction
Before diving into advanced parallelism techniques, we need to understand the key technologies that enable GPUs to communicate with each other.
Why do GPUs need to communicate?
When training AI models across multiple GPUs, each GPU processes a different data batch, but they must stay synchronized. This synchronization is achieved by:
- Sharing gradients during back‑propagation.
- Exchanging model weights (e.g., for model parallelism or checkpointing).
What gets communicated and when depends on your parallelism strategy—topics we’ll explore in depth in upcoming blog posts.
For now, keep in mind that modern AI training is communication‑intensive, making efficient GPU‑to‑GPU data transfer critical for performance.
The Communication Stack
PCIe
PCIe (Peripheral Component Interconnect Express) connects expansion cards—such as GPUs—to the motherboard using independent point‑to‑point serial lanes. Below is the bandwidth a single GPU gets when it uses a full 16‑lane slot for each PCIe generation:
| Generation | Lanes | Bandwidth (bidirectional) |
|---|---|---|
| Gen 4 | x16 | ~32 GB/s |
| Gen 5 | x16 | ~64 GB/s |
| Gen 6 | x16 | ~128 GB/s ¹ |
¹ 16 lanes × 8 GB/s per lane = 128 GB/s.
High‑end server CPUs typically expose 128 PCIe lanes. Since a modern GPU needs 16 lanes for optimal throughput, a server can comfortably host 8 GPUs (128 ÷ 16 = 8). Power consumption and chassis space also make it impractical to exceed eight GPUs in a single node.
NVLink
NVLink is NVIDIA’s proprietary interconnect that lets GPUs talk directly to each other—bypassing the CPU and PCIe. It creates a high‑bandwidth memory‑to‑memory pathway between GPUs.
| NVLink version | GPU (example) | Bandwidth per GPU |
|---|---|---|
| NVLink 3 | A100 | ~600 GB/s |
| NVLink 4 | H100 | ~900 GB/s |
| NVLink 5 | Blackwell | up to 1.8 TB/s |

Source: GitHub (MIT license)
NVLink for CPU‑GPU communication
Some CPU architectures can replace PCIe with NVLink, dramatically accelerating data movement between CPU and GPU (e.g., moving training batches). This makes CPU‑offloading—storing intermediate data in system RAM instead of VRAM—practical for real‑world AI workloads. Because RAM scales far more cheaply than VRAM, the approach yields significant cost savings.
CPUs that support NVLink include IBM POWER8, POWER9, and NVIDIA Grace.
Note: In an 8‑GPU H100 server each GPU must talk to the other seven GPUs. The 900 GB/s per‑GPU bandwidth is therefore split across seven point‑to‑point links (~128 GB/s each). NVSwitch solves this limitation.
NVSwitch
NVSwitch is a central hub that dynamically routes traffic between GPUs. With NVSwitch, every Hopper GPU can simultaneously communicate with all other GPUs at full NVLink speed (900 GB/s), making the fabric non‑blocking.
- Intra‑node: up to 256 GPUs can be connected with near‑local NVLink performance.
- Inter‑node: multiple NVSwitch‑equipped nodes can be linked to form large GPU clusters.
NVSwitch generations
| Generation | Supported GPUs | Key features |
|---|---|---|
| 1st | Up to 16 (Tesla V100) | Baseline NVSwitch fabric |
| 2nd | Up to 16 | Higher bandwidth, lower latency |
| 3rd | Up to 256 (H100) | Designed for Hopper GPUs, maximal scalability |
InfiniBand
InfiniBand provides inter‑node communication. It is slower—and cheaper—than NVSwitch but is the workhorse for scaling to thousands of GPUs in data centers. Modern InfiniBand adapters support NVIDIA GPUDirect ® RDMA, allowing the network card to read/write GPU memory directly without involving the CPU or host RAM.
| InfiniBand speed | Bandwidth per port |
|---|---|
| HDR (High Data Rate) | ~25 GB/s |
| NDR (Next Data Rate) | ~50 GB/s |
| NDR200 | ~100 GB/s |
These rates are lower than intra‑node NVLink because of network‑protocol overhead and the need for two PCIe traversals (one at the sender, one at the receiver).
Key Design Principles
Understanding Linear Scaling
Linear scaling is the “holy grail” of distributed computing. In simple terms, it means that doubling the number of GPUs should double throughput and halve training time. This ideal occurs when communication overhead is negligible compared to computation time, allowing each GPU to run at full capacity.
In practice, perfect linear scaling is rare for AI workloads because:
- Communication requirements grow with the number of devices.
- Achieving complete compute‑communication overlap is usually impossible (see next section).
The Importance of Compute‑Communication Overlap
If a GPU sits idle waiting for data to be transferred before it can start processing, resources are wasted. To maximize efficiency, communication operations should overlap with computation as much as possible. When overlap cannot be achieved, the communication is referred to as an exposed operation.
Intra‑Node vs. Inter‑Node: The Performance Cliff
| Scope | Typical GPU count | Communication characteristics | Scaling behavior |
|---|---|---|---|
| Intra‑node | Up to 8 GPUs (server‑grade motherboards) | High‑bandwidth, low‑latency links (NVLink, PCIe) | Near‑linear scaling is often achievable. |
| Inter‑node | > 8 GPUs across multiple servers (InfiniBand, Ethernet) | Lower bandwidth, higher latency, protocol overhead | Significant performance degradation; each GPU must coordinate with more peers, leading to more idle time. |
Takeaway: While intra‑node communication can keep scaling efficient, crossing the node boundary introduces a steep performance cliff due to slower inter‑node links and increased coordination overhead. Planning your topology to stay within the intra‑node sweet spot—or mitigating inter‑node costs with techniques like gradient compression, pipeline parallelism, or optimized collective algorithms—is essential for maintaining high efficiency at scale.
Conclusion
Follow me on X for more free AI content: @l_cesconetto
Congratulations on making it to the end! In this post you learned about:
The CPU‑GPU and GPU‑GPU communication fundamentals
- PCIe, NVLink, NVSwitch, and InfiniBand
- Key design principles for distributed GPU computing
- How to make more informed decisions when designing your AI workloads
In the next blog post, we’ll dive into our first parallelism technique: Distributed Data Parallelism.