AI in Multiple GPUs: How GPUs Communicate

Published: 3 days ago (February 19, 2026 at 07:00 AM EST)

6 min read

Source: Towards Data Science

Series: Distributed AI Across Multiple GPUs

Part 1: Understanding the Host and Device Paradigm
Part 2: Point‑to‑Point and Collective Operations
Part 3: How GPUs Communicate — (this article)
Part 4: Gradient Accumulation & Distributed Data Parallelism (DDP) — coming soon
Part 5: ZeRO — coming soon
Part 6: Tensor Parallelism — coming soon

Introduction

Before diving into advanced parallelism techniques, we need to understand the key technologies that enable GPUs to communicate with each other.

Why do GPUs need to communicate?

When training AI models across multiple GPUs, each GPU processes a different data batch, but they must stay synchronized. This synchronization is achieved by:

Sharing gradients during back‑propagation.
Exchanging model weights (e.g., for model parallelism or checkpointing).

What gets communicated and when depends on your parallelism strategy—topics we’ll explore in depth in upcoming blog posts.

For now, keep in mind that modern AI training is communication‑intensive, making efficient GPU‑to‑GPU data transfer critical for performance.

The Communication Stack

PCIe

PCIe (Peripheral Component Interconnect Express) connects expansion cards—such as GPUs—to the motherboard using independent point‑to‑point serial lanes. Below is the bandwidth a single GPU gets when it uses a full 16‑lane slot for each PCIe generation:

Generation	Lanes	Bandwidth (bidirectional)
Gen 4	x16	~32 GB/s
Gen 5	x16	~64 GB/s
Gen 6	x16	~128 GB/s ¹

¹ 16 lanes × 8 GB/s per lane = 128 GB/s.

High‑end server CPUs typically expose 128 PCIe lanes. Since a modern GPU needs 16 lanes for optimal throughput, a server can comfortably host 8 GPUs (128 ÷ 16 = 8). Power consumption and chassis space also make it impractical to exceed eight GPUs in a single node.

NVLink

NVLink is NVIDIA’s proprietary interconnect that lets GPUs talk directly to each other—bypassing the CPU and PCIe. It creates a high‑bandwidth memory‑to‑memory pathway between GPUs.

NVLink version	GPU (example)	Bandwidth per GPU
NVLink 3	A100	~600 GB/s
NVLink 4	H100	~900 GB/s
NVLink 5	Blackwell	up to 1.8 TB/s

NVLink bandwidth comparison
Source: GitHub (MIT license)

NVLink for CPU‑GPU communication

Some CPU architectures can replace PCIe with NVLink, dramatically accelerating data movement between CPU and GPU (e.g., moving training batches). This makes CPU‑offloading—storing intermediate data in system RAM instead of VRAM—practical for real‑world AI workloads. Because RAM scales far more cheaply than VRAM, the approach yields significant cost savings.

CPUs that support NVLink include IBM POWER8, POWER9, and NVIDIA Grace.

Note: In an 8‑GPU H100 server each GPU must talk to the other seven GPUs. The 900 GB/s per‑GPU bandwidth is therefore split across seven point‑to‑point links (~128 GB/s each). NVSwitch solves this limitation.

NVSwitch

NVSwitch is a central hub that dynamically routes traffic between GPUs. With NVSwitch, every Hopper GPU can simultaneously communicate with all other GPUs at full NVLink speed (900 GB/s), making the fabric non‑blocking.

Intra‑node: up to 256 GPUs can be connected with near‑local NVLink performance.
Inter‑node: multiple NVSwitch‑equipped nodes can be linked to form large GPU clusters.

NVSwitch generations

Generation	Supported GPUs	Key features
1st	Up to 16 (Tesla V100)	Baseline NVSwitch fabric
2nd	Up to 16	Higher bandwidth, lower latency
3rd	Up to 256 (H100)	Designed for Hopper GPUs, maximal scalability

InfiniBand

InfiniBand provides inter‑node communication. It is slower—and cheaper—than NVSwitch but is the workhorse for scaling to thousands of GPUs in data centers. Modern InfiniBand adapters support NVIDIA GPUDirect ® RDMA, allowing the network card to read/write GPU memory directly without involving the CPU or host RAM.

InfiniBand speed	Bandwidth per port
HDR (High Data Rate)	~25 GB/s
NDR (Next Data Rate)	~50 GB/s
NDR200	~100 GB/s

These rates are lower than intra‑node NVLink because of network‑protocol overhead and the need for two PCIe traversals (one at the sender, one at the receiver).

Key Design Principles

Understanding Linear Scaling

Linear scaling is the “holy grail” of distributed computing. In simple terms, it means that doubling the number of GPUs should double throughput and halve training time. This ideal occurs when communication overhead is negligible compared to computation time, allowing each GPU to run at full capacity.

In practice, perfect linear scaling is rare for AI workloads because:

Communication requirements grow with the number of devices.
Achieving complete compute‑communication overlap is usually impossible (see next section).

The Importance of Compute‑Communication Overlap

If a GPU sits idle waiting for data to be transferred before it can start processing, resources are wasted. To maximize efficiency, communication operations should overlap with computation as much as possible. When overlap cannot be achieved, the communication is referred to as an exposed operation.

Intra‑Node vs. Inter‑Node: The Performance Cliff

Scope	Typical GPU count	Communication characteristics	Scaling behavior
Intra‑node	Up to 8 GPUs (server‑grade motherboards)	High‑bandwidth, low‑latency links (NVLink, PCIe)	Near‑linear scaling is often achievable.
Inter‑node	> 8 GPUs across multiple servers (InfiniBand, Ethernet)	Lower bandwidth, higher latency, protocol overhead	Significant performance degradation; each GPU must coordinate with more peers, leading to more idle time.

Takeaway: While intra‑node communication can keep scaling efficient, crossing the node boundary introduces a steep performance cliff due to slower inter‑node links and increased coordination overhead. Planning your topology to stay within the intra‑node sweet spot—or mitigating inter‑node costs with techniques like gradient compression, pipeline parallelism, or optimized collective algorithms—is essential for maintaining high efficiency at scale.

Conclusion

Follow me on X for more free AI content: @l_cesconetto

Congratulations on making it to the end! In this post you learned about:

The CPU‑GPU and GPU‑GPU communication fundamentals

PCIe, NVLink, NVSwitch, and InfiniBand
Key design principles for distributed GPU computing
How to make more informed decisions when designing your AI workloads

In the next blog post, we’ll dive into our first parallelism technique: Distributed Data Parallelism.