AI in Multiple GPUs: Understanding the Host and Device Paradigm

Published: 3 days ago (February 12, 2026 at 08:00 AM EST)

9 min read

Source: Towards Data Science

Part 1: Understanding the Host and Device Paradigm — this article
Part 2: Point‑to‑Point and Collective Operations — coming soon
Part 3: How GPUs Communicate — coming soon
Part 4: Gradient Accumulation & Distributed Data Parallelism (DDP) — coming soon
Part 5: ZeRO — coming soon
Part 6: Tensor Parallelism — coming soon

Introduction

This guide explains the foundational concepts of how a CPU and a discrete graphics card (GPU) work together. It provides a high‑level overview to help you build a mental model of the host‑device paradigm, focusing specifically on NVIDIA GPUs, which are the most commonly used for AI workloads.

Note: Integrated GPUs (e.g., those in Apple Silicon chips) have a different architecture and are not covered in this post.

The Big Picture: Host vs. Device

The most important concept to grasp is the relationship between the Host and the Device.

Component	What it is	Role
Host	Your CPU	Runs the operating system and executes your Python script line‑by‑line. It is the commander that controls the overall logic and tells the Device what to do.
Device	Your GPU	A powerful, specialized coprocessor designed for massively parallel computations. It is the accelerator that does nothing until the Host assigns it a task.

Your program always starts on the CPU.
When you want the GPU to perform a task (e.g., multiplying two large matrices), the CPU sends the instructions and the data over to the GPU.

Understanding this host‑device interaction is the foundation for effective GPU programming.

The CPU‑GPU Interaction

The host (CPU) communicates with the device (GPU) through a queuing system.

CPU initiates commands – When your script (running on the CPU) reaches a line intended for the GPU (e.g., tensor.to('cuda')), it creates a command for the GPU.
Commands are queued – The CPU does not wait; it places the command onto a special to‑do list for the GPU called a CUDA stream (more on this in the next section).
Asynchronous execution – The CPU continues executing subsequent lines of code while the GPU works on the queued operation. This asynchronous execution is essential for high performance, allowing the CPU to prepare the next batch of data or perform other tasks while the GPU crunches numbers.

CUDA Streams

A CUDA Stream is an ordered queue of GPU operations. Operations submitted to a single stream execute in order, one after another. Operations in different streams, however, can run concurrently, allowing the GPU to juggle multiple independent workloads at the same time.

By default, every PyTorch GPU operation is enqueued on the current active stream (usually the default stream that PyTorch creates automatically). This is simple and predictable: each operation waits for the previous one to finish before starting. For most code you never notice this, but it can leave performance on the table when you have work that could overlap.

Multiple Streams: Concurrency

The classic use case for multiple streams is overlapping computation with data transfers. While the GPU processes batch N, you can simultaneously copy batch N + 1 from CPU RAM to GPU VRAM:

Stream 0 (compute): [process batch 0]────[process batch 1]───
Stream 1 (transfer): ────[copy batch 1]────[copy batch 2]───

This pipeline works because compute and data transfer use separate hardware units inside the GPU, enabling true parallelism.

In PyTorch you create streams and schedule work onto them with context managers:

import torch

compute_stream  = torch.cuda.Stream()
transfer_stream = torch.cuda.Stream()

# -------------------------------------------------
# Transfer work (runs on `transfer_stream`)
# -------------------------------------------------
with torch.cuda.stream(transfer_stream):
    # `non_blocking=True` lets the copy proceed asynchronously
    next_batch = next_batch_cpu.to('cuda', non_blocking=True)

# -------------------------------------------------
# Compute work (runs on `compute_stream`)
# -------------------------------------------------
with torch.cuda.stream(compute_stream):
    # This runs concurrently with the transfer above
    output = model(current_batch)

Tip: The non_blocking=True flag on .to() is essential. Without it the transfer would still block the CPU thread even though you intend it to run asynchronously.

Synchronization Between Streams

Since streams are independent, you must explicitly signal when one depends on another.

Global synchronization (blunt tool):

torch.cuda.synchronize()   # Waits for *all* streams on the device to finish

Fine‑grained synchronization with CUDA Events (surgical tool).
An event marks a specific point in a stream; another stream can wait on that event without halting the CPU thread.

event = torch.cuda.Event()

# -------------------------------------------------
# Transfer stream: copy data and record an event
# -------------------------------------------------
with torch.cuda.stream(transfer_stream):
    next_batch = next_batch_cpu.to('cuda', non_blocking=True)
    event.record()                     # Mark: transfer is done

# -------------------------------------------------
# Compute stream: wait for the transfer before using the data
# -------------------------------------------------
with torch.cuda.stream(compute_stream):
    compute_stream.wait_event(event)    # Stall only this stream on GPU
    output = model(next_batch)

Using events is more efficient than stream.synchronize() because only the dependent stream stalls on the GPU side; the CPU thread remains free to continue queuing work.

When Do You Need to Manage Streams Manually?

For typical PyTorch training loops you rarely have to touch streams directly. However, many high‑level utilities—such as DataLoader(pin_memory=True) and custom pre‑fetching pipelines—leverage this mechanism under the hood. Understanding streams helps you:

Recognize why those settings exist.
Diagnose subtle performance bottlenecks when they appear.
Build advanced pipelines that overlap computation, data movement, and even kernel launches for maximal throughput.

PyTorch Tensors

PyTorch is a powerful framework that abstracts away many details, but this abstraction can sometimes obscure what is happening under the hood.

When you create a PyTorch tensor, it consists of two parts:

Metadata – shape, data type, device, etc.
Numerical data – the actual values stored on the device.

t = torch.randn(100, 100, device=device)

The metadata lives in the host’s RAM.
The data lives in the device’s memory (e.g., GPU VRAM).

Why the distinction matters

print(t.shape) – The CPU can retrieve the shape instantly because the metadata is already in RAM.
print(t) – To display the tensor’s contents, PyTorch must transfer the data from VRAM to the host, which incurs a device‑to‑host copy and can be costly for large tensors.

Host‑Device Synchronization

Accessing GPU data from the CPU triggers a host‑device synchronization, a common performance bottleneck. This happens whenever the CPU needs a result that the GPU has not yet written back to main memory.

Why it matters

print(gpu_tensor)

gpu_tensor is still being computed on the GPU. The CPU cannot print its values until the GPU finishes the computation and copies the data from VRAM to RAM. At this point the CPU blocks (i.e., it waits), which stalls the whole program.

Efficient tensor creation

Code	What happens	Efficiency
`torch.randn(100, 100).to(device)`	Creates the tensor on the CPU, then transfers it to the GPU.	Less efficient – two steps (allocation + copy).
`torch.randn(100, 100, device=device)`	Instructs the GPU to allocate and fill the tensor directly.	More efficient – only a single allocation on the GPU.

Takeaway

Every synchronization point forces the host and device to wait for each other, dramatically reducing throughput. Good GPU programming strives to minimize these points so that both the CPU and GPU stay busy.

“You want your GPUs to go brrrrr.”

GPUs going brrrrr
Image by author (generated with ChatGPT)

Scaling Up: Distributed Computing and Ranks

Training large models—especially Large Language Models (LLMs)—often exceeds the compute capacity of a single GPU. To harness multiple GPUs you need distributed computing, and with it comes the concept of a rank.

What is a Rank?

A rank is a separate CPU process that is assigned:
1. One GPU device (e.g., cuda:0, cuda:1, …)
2. A unique integer ID (0, 1, 2, …)

When you launch a training script on two GPUs, two processes are created:

Process	Rank ID	Assigned GPU
1	`0`	`cuda:0`
2	`1`	`cuda:1`

Each process runs its own instance of the Python script. On a single machine (a single node) the processes share the same CPU but do not share memory or state. They operate independently, coordinated only through the distributed backend you configure (e.g., NCCL, Gloo, MPI).

Why Ranks Matter

Even though every rank executes the same code, you can use the rank ID to:

Partition data: each rank processes a different slice of the dataset.
Assign roles: rank 0 often handles logging, checkpointing, or validation, while other ranks focus purely on training.
Control device placement: torch.cuda.set_device(rank) ensures each process talks to its own GPU.

Minimal Example (PyTorch)

import os
import torch
import torch.distributed as dist

def main():
    # 1️⃣  Initialize the process group
    dist.init_process_group(
        backend="nccl",          # NCCL works best for GPUs
        init_method="env://",    # Use environment variables for init
        world_size=int(os.environ["WORLD_SIZE"]),
        rank=int(os.environ["RANK"]),
    )

    # 2️⃣  Set the device for this rank
    rank = dist.get_rank()
    torch.cuda.set_device(rank)
    device = torch.device(f"cuda:{rank}")

    # 3️⃣  Example: each rank gets its own data slice
    dataset = torch.arange(0, 100)          # dummy dataset
    per_rank = len(dataset) // dist.get_world_size()
    start = rank * per_rank
    end = start + per_rank
    local_data = dataset[start:end].to(device)

    # 4️⃣  Your training loop would go here …
    print(f"Rank {rank} handling data {start}:{end} on {device}")

    # 5️⃣  Clean up
    dist.destroy_process_group()

if __name__ == "__main__":
    main()

How to launch (two‑GPU node):

torchrun --nproc_per_node=2 \
    --master_addr=127.0.0.1 \
    --master_port=29500 \
    your_script.py

torchrun automatically sets the RANK and WORLD_SIZE environment variables for each process.
Each process receives its own GPU (cuda:0 for rank 0, cuda:1 for rank 1) and can work on its own data slice.

Takeaway

Rank = independent process + unique ID + dedicated GPU.
Use the rank ID to split work, manage logging, and keep each GPU busy.
The next blog post will dive deeper into data parallelism, gradient synchronization, and practical tricks for scaling up training.

Conclusion

Congratulations on making it to the end of the post! Here’s what you’ve learned:

Host/Device relationship
Asynchronous execution
CUDA streams and how they enable concurrent GPU work
Host‑Device synchronization

Next up: In the upcoming blog post we’ll dive deeper into point‑to‑point and collective operations, which allow multiple GPUs to coordinate complex workflows such as distributed neural‑network training.

AI in Multiple GPUs: Understanding the Host and Device Paradigm

Introduction

The Big Picture: Host vs. Device

The CPU‑GPU Interaction

CUDA Streams

Multiple Streams: Concurrency

Synchronization Between Streams

When Do You Need to Manage Streams Manually?

PyTorch Tensors

Why the distinction matters

Host‑Device Synchronization

Why it matters

Efficient tensor creation

Takeaway

Scaling Up: Distributed Computing and Ranks

What is a Rank?

Why Ranks Matter

Minimal Example (PyTorch)

Takeaway

Conclusion

Related posts

GANs Explained Simply: The Two-Neural-Network Battle That Changed AI

Haar Cascades to YOLO: Face Detection Migration Guide

Image Classification with CNNs – Part 3: Understanding Max Pooling and Results

Advanced LangGraph Orchestration: Enterprise-Ready AI Workflow Management