AI in Multiple GPUs: Understanding the Host and Device Paradigm
Source: Towards Data Science
- Part 1: Understanding the Host and Device Paradigm — this article
- Part 2: Point‑to‑Point and Collective Operations — coming soon
- Part 3: How GPUs Communicate — coming soon
- Part 4: Gradient Accumulation & Distributed Data Parallelism (DDP) — coming soon
- Part 5: ZeRO — coming soon
- Part 6: Tensor Parallelism — coming soon
Introduction
This guide explains the foundational concepts of how a CPU and a discrete graphics card (GPU) work together. It provides a high‑level overview to help you build a mental model of the host‑device paradigm, focusing specifically on NVIDIA GPUs, which are the most commonly used for AI workloads.
Note: Integrated GPUs (e.g., those in Apple Silicon chips) have a different architecture and are not covered in this post.
The Big Picture: Host vs. Device
The most important concept to grasp is the relationship between the Host and the Device.
| Component | What it is | Role |
|---|---|---|
| Host | Your CPU | Runs the operating system and executes your Python script line‑by‑line. It is the commander that controls the overall logic and tells the Device what to do. |
| Device | Your GPU | A powerful, specialized coprocessor designed for massively parallel computations. It is the accelerator that does nothing until the Host assigns it a task. |
- Your program always starts on the CPU.
- When you want the GPU to perform a task (e.g., multiplying two large matrices), the CPU sends the instructions and the data over to the GPU.
Understanding this host‑device interaction is the foundation for effective GPU programming.
The CPU‑GPU Interaction
The host (CPU) communicates with the device (GPU) through a queuing system.
- CPU initiates commands – When your script (running on the CPU) reaches a line intended for the GPU (e.g.,
tensor.to('cuda')), it creates a command for the GPU. - Commands are queued – The CPU does not wait; it places the command onto a special to‑do list for the GPU called a CUDA stream (more on this in the next section).
- Asynchronous execution – The CPU continues executing subsequent lines of code while the GPU works on the queued operation. This asynchronous execution is essential for high performance, allowing the CPU to prepare the next batch of data or perform other tasks while the GPU crunches numbers.
CUDA Streams
A CUDA Stream is an ordered queue of GPU operations. Operations submitted to a single stream execute in order, one after another. Operations in different streams, however, can run concurrently, allowing the GPU to juggle multiple independent workloads at the same time.
By default, every PyTorch GPU operation is enqueued on the current active stream (usually the default stream that PyTorch creates automatically). This is simple and predictable: each operation waits for the previous one to finish before starting. For most code you never notice this, but it can leave performance on the table when you have work that could overlap.
Multiple Streams: Concurrency
The classic use case for multiple streams is overlapping computation with data transfers. While the GPU processes batch N, you can simultaneously copy batch N + 1 from CPU RAM to GPU VRAM:
Stream 0 (compute): [process batch 0]────[process batch 1]───
Stream 1 (transfer): ────[copy batch 1]────[copy batch 2]───
This pipeline works because compute and data transfer use separate hardware units inside the GPU, enabling true parallelism.
In PyTorch you create streams and schedule work onto them with context managers:
import torch
compute_stream = torch.cuda.Stream()
transfer_stream = torch.cuda.Stream()
# -------------------------------------------------
# Transfer work (runs on `transfer_stream`)
# -------------------------------------------------
with torch.cuda.stream(transfer_stream):
# `non_blocking=True` lets the copy proceed asynchronously
next_batch = next_batch_cpu.to('cuda', non_blocking=True)
# -------------------------------------------------
# Compute work (runs on `compute_stream`)
# -------------------------------------------------
with torch.cuda.stream(compute_stream):
# This runs concurrently with the transfer above
output = model(current_batch)
Tip: The
non_blocking=Trueflag on.to()is essential. Without it the transfer would still block the CPU thread even though you intend it to run asynchronously.
Synchronization Between Streams
Since streams are independent, you must explicitly signal when one depends on another.
-
Global synchronization (blunt tool):
torch.cuda.synchronize() # Waits for *all* streams on the device to finish -
Fine‑grained synchronization with CUDA Events (surgical tool).
An event marks a specific point in a stream; another stream can wait on that event without halting the CPU thread.event = torch.cuda.Event() # ------------------------------------------------- # Transfer stream: copy data and record an event # ------------------------------------------------- with torch.cuda.stream(transfer_stream): next_batch = next_batch_cpu.to('cuda', non_blocking=True) event.record() # Mark: transfer is done # ------------------------------------------------- # Compute stream: wait for the transfer before using the data # ------------------------------------------------- with torch.cuda.stream(compute_stream): compute_stream.wait_event(event) # Stall only this stream on GPU output = model(next_batch)
Using events is more efficient than stream.synchronize() because only the dependent stream stalls on the GPU side; the CPU thread remains free to continue queuing work.
When Do You Need to Manage Streams Manually?
For typical PyTorch training loops you rarely have to touch streams directly. However, many high‑level utilities—such as DataLoader(pin_memory=True) and custom pre‑fetching pipelines—leverage this mechanism under the hood. Understanding streams helps you:
- Recognize why those settings exist.
- Diagnose subtle performance bottlenecks when they appear.
- Build advanced pipelines that overlap computation, data movement, and even kernel launches for maximal throughput.
PyTorch Tensors
PyTorch is a powerful framework that abstracts away many details, but this abstraction can sometimes obscure what is happening under the hood.
When you create a PyTorch tensor, it consists of two parts:
- Metadata – shape, data type, device, etc.
- Numerical data – the actual values stored on the device.
t = torch.randn(100, 100, device=device)
- The metadata lives in the host’s RAM.
- The data lives in the device’s memory (e.g., GPU VRAM).
Why the distinction matters
print(t.shape)– The CPU can retrieve the shape instantly because the metadata is already in RAM.print(t)– To display the tensor’s contents, PyTorch must transfer the data from VRAM to the host, which incurs a device‑to‑host copy and can be costly for large tensors.
Host‑Device Synchronization
Accessing GPU data from the CPU triggers a host‑device synchronization, a common performance bottleneck. This happens whenever the CPU needs a result that the GPU has not yet written back to main memory.
Why it matters
print(gpu_tensor)
gpu_tensor is still being computed on the GPU. The CPU cannot print its values until the GPU finishes the computation and copies the data from VRAM to RAM. At this point the CPU blocks (i.e., it waits), which stalls the whole program.
Efficient tensor creation
| Code | What happens | Efficiency |
|---|---|---|
torch.randn(100, 100).to(device) | Creates the tensor on the CPU, then transfers it to the GPU. | Less efficient – two steps (allocation + copy). |
torch.randn(100, 100, device=device) | Instructs the GPU to allocate and fill the tensor directly. | More efficient – only a single allocation on the GPU. |
Takeaway
Every synchronization point forces the host and device to wait for each other, dramatically reducing throughput. Good GPU programming strives to minimize these points so that both the CPU and GPU stay busy.
“You want your GPUs to go brrrrr.”

Image by author (generated with ChatGPT)
Scaling Up: Distributed Computing and Ranks
Training large models—especially Large Language Models (LLMs)—often exceeds the compute capacity of a single GPU. To harness multiple GPUs you need distributed computing, and with it comes the concept of a rank.
What is a Rank?
- A rank is a separate CPU process that is assigned:
- One GPU device (e.g.,
cuda:0,cuda:1, …) - A unique integer ID (
0, 1, 2, …)
- One GPU device (e.g.,
When you launch a training script on two GPUs, two processes are created:
| Process | Rank ID | Assigned GPU |
|---|---|---|
| 1 | 0 | cuda:0 |
| 2 | 1 | cuda:1 |
Each process runs its own instance of the Python script. On a single machine (a single node) the processes share the same CPU but do not share memory or state. They operate independently, coordinated only through the distributed backend you configure (e.g., NCCL, Gloo, MPI).
Why Ranks Matter
Even though every rank executes the same code, you can use the rank ID to:
- Partition data: each rank processes a different slice of the dataset.
- Assign roles: rank 0 often handles logging, checkpointing, or validation, while other ranks focus purely on training.
- Control device placement:
torch.cuda.set_device(rank)ensures each process talks to its own GPU.
Minimal Example (PyTorch)
import os
import torch
import torch.distributed as dist
def main():
# 1️⃣ Initialize the process group
dist.init_process_group(
backend="nccl", # NCCL works best for GPUs
init_method="env://", # Use environment variables for init
world_size=int(os.environ["WORLD_SIZE"]),
rank=int(os.environ["RANK"]),
)
# 2️⃣ Set the device for this rank
rank = dist.get_rank()
torch.cuda.set_device(rank)
device = torch.device(f"cuda:{rank}")
# 3️⃣ Example: each rank gets its own data slice
dataset = torch.arange(0, 100) # dummy dataset
per_rank = len(dataset) // dist.get_world_size()
start = rank * per_rank
end = start + per_rank
local_data = dataset[start:end].to(device)
# 4️⃣ Your training loop would go here …
print(f"Rank {rank} handling data {start}:{end} on {device}")
# 5️⃣ Clean up
dist.destroy_process_group()
if __name__ == "__main__":
main()
How to launch (two‑GPU node):
torchrun --nproc_per_node=2 \
--master_addr=127.0.0.1 \
--master_port=29500 \
your_script.py
torchrunautomatically sets theRANKandWORLD_SIZEenvironment variables for each process.- Each process receives its own GPU (
cuda:0for rank 0,cuda:1for rank 1) and can work on its own data slice.
Takeaway
- Rank = independent process + unique ID + dedicated GPU.
- Use the rank ID to split work, manage logging, and keep each GPU busy.
- The next blog post will dive deeper into data parallelism, gradient synchronization, and practical tricks for scaling up training.
Conclusion
Congratulations on making it to the end of the post! Here’s what you’ve learned:
- Host/Device relationship
- Asynchronous execution
- CUDA streams and how they enable concurrent GPU work
- Host‑Device synchronization
Next up: In the upcoming blog post we’ll dive deeper into point‑to‑point and collective operations, which allow multiple GPUs to coordinate complex workflows such as distributed neural‑network training.