TPUs vs. GPUs: What They Are, How They Differ, and Which Workloads Belong on Each
Source: Dev.to
GPU vs. TPU on Google Cloud
If you’ve worked with machine learning on Google Cloud, you’ve hit the choice: GPU instance or TPU?
Most teams default to GPU because that’s what they already know. But as inference costs climb and TPU tooling matures, it’s worth understanding what each chip actually does and when one outperforms the other.
This post covers
- What GPUs and TPUs are
- How they work
- Which workloads run better on each
- Google’s current TPU lineup (including the eighth‑generation chips announced at Google Cloud Next 2026)
Image placeholder
Image source: Google Cloud
1. Background
GPUs
- Originally built for rendering video games.
- Handle AI workloads well because the underlying math—large‑scale parallel floating‑point operations—is the same.
- Researchers realized this around 2012, and GPUs quickly became the default for training neural networks.
TPUs
- In 2013, Google Brain engineers calculated that if every Android user used voice search for just three minutes a day, Google would need to double its global data‑center capacity.
- Running inference on general‑purpose GPUs at that scale was too expensive and power‑hungry.
- Google’s solution: a chip designed specifically for neural‑network math.
- The first TPU entered production in Google’s data centers in 2015 and became publicly available as Cloud TPU in 2018.
- Core idea: strip out everything a GPU carries from its graphics origins and focus entirely on matrix multiplication—a principle that still drives every TPU generation today.
2. How the Chips Work
2.1 GPU Architecture
- A parallel processor with thousands of smaller cores.
- Compared to a CPU (8–64 powerful general‑purpose cores), a high‑end GPU like the NVIDIA H100 has thousands of cores that run the same instruction across many data points at once (SIMD – Single Instruction, Multiple Data).
- Precision formats supported: FP32, FP16, BF16, INT8, FP8.
- Runs PyTorch, TensorFlow, JAX, CUDA libraries, simulations, rendering pipelines, etc.
- Because of this broad support, a GPU carries hardware for texture mapping, branch prediction, and other operations that sit idle during pure matrix multiplication.
- Memory: The NVIDIA H100 ships with 80 GB of HBM2e on‑package. Memory bandwidth is often the bottleneck for AI workloads, not raw compute.
2.2 TPU Architecture
- Built for one job: tensor math, specifically the matrix multiplications at the core of neural‑network training and inference.
- Key hardware: the systolic array.
- In a standard processor, each operation reads inputs from memory, computes, and writes the result back.
- In a systolic array, data flows through a grid of multiply‑and‑accumulate units. You load the weights once, pass inputs through the grid, and results flow from unit to unit without returning to main memory, eliminating constant memory round‑trips.
- Precision: Google added BF16 support from early generations; GPUs added it later. Recent chips (both GPU and TPU) support FP8 natively, boosting inference throughput.
- Limitations:
- Poor with dynamic control flow, variable‑length sequences, and custom operations.
- Best suited for static computation graphs, which is what most transformer models produce.
3. Choosing the Right Accelerator
3.1 When GPUs Are Recommended
| Workload Type | Reason |
|---|---|
| PyTorch‑first teams | Most research code, open‑source checkpoints, and fine‑tuning guides assume a GPU. |
| TensorFlow ops not on Cloud TPU | Some TF ops are unavailable on TPU (see Google’s op‑list). |
| Dynamic inputs (variable‑length sequences, conditional branches, custom CUDA extensions) | GPUs handle these gracefully; TPUs can be tricky. |
| Medium‑to‑large models with larger effective batch sizes | GPUs scale well with batch size. |
| Multi‑cloud or on‑prem deployments | TPUs exist only on Google Cloud. |
| Mixed workloads (ML training + scientific simulation + rendering) | GPUs are general‑purpose; TPUs are specialized. |
| Small teams moving fast | GPU tooling (profilers, debuggers, community tutorials) is more mature; diagnosing performance issues is easier. |
3.2 When TPUs Shine
| Workload Type | Reason |
|---|---|
| Training massive deep‑learning models (e.g., large language models) | TPUs handle the immense number of matrix calculations efficiently. |
| Models dominated by matrix computations | Systolic array excels at dense linear algebra. |
| Long‑running training jobs (weeks or months) | TPU pods provide high throughput and lower cost per token. |
| Ultra‑large embeddings (advanced ranking & recommendation) | TPUs’ memory architecture is optimized for large weight matrices. |
| Large‑scale transformer training | TPU pods scale to tens of thousands of chips via Google’s Inter‑Chip Interconnect (ICI); training something like Gemma on a TPU pod is often faster and cheaper than on a GPU cluster. |
| High‑volume production inference | TPU v6e (Trillium) and Ironwood are built specifically for inference; Ironwood delivers >4× better performance per chip vs. v6e. |
| Models with no custom PyTorch/JAX ops | Pure‑TensorFlow/JAX workloads map cleanly onto TPU hardware. |
| Google open‑weight models (e.g., Gemma 4, released Apr 2026) | Optimized for TPU serving; Google provides JAX reference implementations and community guides for deploying via vLLM on Cloud TPU. |
3.3 Workloads Not Suited for TPUs
- Linear‑algebra programs that require frequent branching or many element‑wise operations.
- Workloads that need high‑precision arithmetic (e.g., FP64).
- Neural‑network workloads that contain custom operations in the main training loop.
4. Google’s Current TPU Lineup (as of Cloud Next 2026)
| Generation | Codename | Primary Use | Peak Compute* | Energy Efficiency | Notable Features |
|---|---|---|---|---|---|
| v5e | – | General‑purpose training & inference | – | – | Baseline generation |
| v6e (Trillium) | – | High‑volume inference | – | – | Optimized memory bandwidth for serving |
| Ironwood | – | Next‑gen inference | 4× performance per chip vs. v6e | +67 % vs. v5e | FP8 native support, lower latency |
| v8 (8th‑gen) | – | Massive training pods | 4.7× peak compute of v5e | +67 % energy efficiency | Scales to tens of thousands of chips via ICI, integrated with vLLM for serving |
*Peak compute values are relative to the v5e generation and are quoted by Google.
5. Quick Takeaways
- GPU = versatile, mature tooling, works everywhere (cloud, on‑prem). Ideal for dynamic models, mixed workloads, and teams already deep in PyTorch.
- TPU = specialized for dense matrix math, excels at large‑scale training and high‑throughput inference when the workload fits a static graph. Best on Google Cloud, especially for transformer‑heavy workloads and Google‑published models.
Choose the accelerator that aligns with your framework, workload characteristics, and deployment environment.
Google TPU 8th‑Generation Overview
Key take‑aways
- Two new chips: TPU 8t (training) and TPU 8i (inference).
- Both run on Google’s Axion ARM host CPU and use liquid cooling.
- vLLM now supports the TPU v6e for both offline batch inference and online API serving.
TPU v6e (Inference)
- 256 chips per pod – still the work‑horse for cost‑sensitive inference workloads.
- Specs per chip
- 4,614 FP8 TFLOPS
- 192 GB HBM3E memory
- 7.37 TB/s memory bandwidth
- 9.6 Tb/s inter‑chip interconnect
- Pod scaling – up to 9,216 chips → 42.5 FP8 ExaFLOPS per pod (≈ 4× the per‑chip performance of the previous generation).
- Announced at Google Cloud Next 2025.
TPU 8t – Training Chip
- Purpose: High‑throughput model training.
- Pod configuration – 9,600 chips, 2 PB shared HBM memory, 121 FP4 ExaFLOPS compute (≈ 3× the compute per pod of Ironwood).
- Inter‑chip bandwidth – 19.2 Tb/s per chip (double Ironwood).
- Network fabric – Virgo Network can link 134 k chips within a data‑center and theoretically > 1 M chips across sites.
- Data‑path enhancements
- TPUDirect RDMA & TPU Direct Storage bypass the host CPU, doubling bandwidth for large transfers.
- Efficiency target – 97 % goodput (i.e., 97 % of cycles spent on actual learning).
TPU 8i – Inference Chip
- Purpose: Low‑latency, high‑throughput inference (especially Mixture‑of‑Experts).
- Pod configuration – 1,152 chips, 11.6 FP8 ExaFLOPS per pod.
- Memory – 288 GB HBM per chip (more than the 8t) + 384 MB on‑chip SRAM (3× Ironwood).
- Performance
- 80 % better performance‑per‑dollar vs. Ironwood for inference.
- 2× better performance‑per‑watt.
- Interconnect – Boardfly reduces max network hops from 16 → 7, crucial for MoE models.
- Compute units – Replaces Ironwood’s SparseCores with a Collectives Acceleration Engine (CAE), cutting collective‑operation latency by 5×.
Why more memory on the inference chip?
Large MoE inference is memory‑bandwidth bound. Serving tokens requires streaming weights and KV‑cache faster than training, so the 8i packs more HBM per chip.
Tooling & Ecosystem
| Domain | Recommended Tooling |
|---|---|
| Research & Development | GPUs (mature ecosystem, large community) |
| Production AI on TPUs | JAX, TensorFlow, PyTorch XLA, vLLM (for TPU v6e) |
| Model Reference Implementations | MaxText – LLM reference for TPUs (GitHub) |
| Open‑weight LLMs | Gemma – DeepMind library (GitHub) |
| Inference Serving | Gemma 4 on TPU, custom serving stacks |
Further Reading
- Google Cloud Blog – TPU 8t and TPU 8i technical deep dive
- Google Cloud Blog – Ironwood: The first Google TPU for the age of inference
- Google Cloud Blog – Training large models on Ironwood TPUs
- Google Cloud Blog – Performance per dollar of GPUs and TPUs for AI inference
- Google Cloud Blog – Building production AI on Google Cloud TPUs with JAX
- GitHub – MaxText: LLM reference implementation for TPUs
- GitHub – Gemma open‑weight LLM library (DeepMind)
- TechRadar – Google Cloud unveils eighth‑generation TPUs
TL;DR
- TPU 8t: massive training pod (9,600 chips, 121 FP4 ExaFLOPS), double inter‑chip bandwidth, Virgo fabric for massive scaling.
- TPU 8i: inference‑focused pod (1,152 chips, 11.6 FP8 ExaFLOPS), more on‑chip memory, Boardfly interconnect, CAE for fast collectives.
- Both chips deliver dramatically better performance‑per‑dollar and performance‑per‑watt versus the previous Ironwood generation, and they are fully supported by the latest tooling (JAX, vLLM, MaxText, etc.).
TPUSprint (the original segment’s signature)