TPUs vs. GPUs: What They Are, How They Differ, and Which Workloads Belong on Each

Published: (April 30, 2026 at 09:53 PM EDT)
9 min read
Source: Dev.to

Source: Dev.to

GPU vs. TPU on Google Cloud

If you’ve worked with machine learning on Google Cloud, you’ve hit the choice: GPU instance or TPU?
Most teams default to GPU because that’s what they already know. But as inference costs climb and TPU tooling matures, it’s worth understanding what each chip actually does and when one outperforms the other.

This post covers

  1. What GPUs and TPUs are
  2. How they work
  3. Which workloads run better on each
  4. Google’s current TPU lineup (including the eighth‑generation chips announced at Google Cloud Next 2026)

Image placeholder

Image source: Google Cloud


1. Background

GPUs

  • Originally built for rendering video games.
  • Handle AI workloads well because the underlying math—large‑scale parallel floating‑point operations—is the same.
  • Researchers realized this around 2012, and GPUs quickly became the default for training neural networks.

TPUs

  • In 2013, Google Brain engineers calculated that if every Android user used voice search for just three minutes a day, Google would need to double its global data‑center capacity.
  • Running inference on general‑purpose GPUs at that scale was too expensive and power‑hungry.
  • Google’s solution: a chip designed specifically for neural‑network math.
  • The first TPU entered production in Google’s data centers in 2015 and became publicly available as Cloud TPU in 2018.
  • Core idea: strip out everything a GPU carries from its graphics origins and focus entirely on matrix multiplication—a principle that still drives every TPU generation today.

2. How the Chips Work

2.1 GPU Architecture

  • A parallel processor with thousands of smaller cores.
  • Compared to a CPU (8–64 powerful general‑purpose cores), a high‑end GPU like the NVIDIA H100 has thousands of cores that run the same instruction across many data points at once (SIMD – Single Instruction, Multiple Data).
  • Precision formats supported: FP32, FP16, BF16, INT8, FP8.
  • Runs PyTorch, TensorFlow, JAX, CUDA libraries, simulations, rendering pipelines, etc.
  • Because of this broad support, a GPU carries hardware for texture mapping, branch prediction, and other operations that sit idle during pure matrix multiplication.
  • Memory: The NVIDIA H100 ships with 80 GB of HBM2e on‑package. Memory bandwidth is often the bottleneck for AI workloads, not raw compute.

2.2 TPU Architecture

  • Built for one job: tensor math, specifically the matrix multiplications at the core of neural‑network training and inference.
  • Key hardware: the systolic array.
    • In a standard processor, each operation reads inputs from memory, computes, and writes the result back.
    • In a systolic array, data flows through a grid of multiply‑and‑accumulate units. You load the weights once, pass inputs through the grid, and results flow from unit to unit without returning to main memory, eliminating constant memory round‑trips.
  • Precision: Google added BF16 support from early generations; GPUs added it later. Recent chips (both GPU and TPU) support FP8 natively, boosting inference throughput.
  • Limitations:
    • Poor with dynamic control flow, variable‑length sequences, and custom operations.
    • Best suited for static computation graphs, which is what most transformer models produce.

3. Choosing the Right Accelerator

Workload TypeReason
PyTorch‑first teamsMost research code, open‑source checkpoints, and fine‑tuning guides assume a GPU.
TensorFlow ops not on Cloud TPUSome TF ops are unavailable on TPU (see Google’s op‑list).
Dynamic inputs (variable‑length sequences, conditional branches, custom CUDA extensions)GPUs handle these gracefully; TPUs can be tricky.
Medium‑to‑large models with larger effective batch sizesGPUs scale well with batch size.
Multi‑cloud or on‑prem deploymentsTPUs exist only on Google Cloud.
Mixed workloads (ML training + scientific simulation + rendering)GPUs are general‑purpose; TPUs are specialized.
Small teams moving fastGPU tooling (profilers, debuggers, community tutorials) is more mature; diagnosing performance issues is easier.

3.2 When TPUs Shine

Workload TypeReason
Training massive deep‑learning models (e.g., large language models)TPUs handle the immense number of matrix calculations efficiently.
Models dominated by matrix computationsSystolic array excels at dense linear algebra.
Long‑running training jobs (weeks or months)TPU pods provide high throughput and lower cost per token.
Ultra‑large embeddings (advanced ranking & recommendation)TPUs’ memory architecture is optimized for large weight matrices.
Large‑scale transformer trainingTPU pods scale to tens of thousands of chips via Google’s Inter‑Chip Interconnect (ICI); training something like Gemma on a TPU pod is often faster and cheaper than on a GPU cluster.
High‑volume production inferenceTPU v6e (Trillium) and Ironwood are built specifically for inference; Ironwood delivers >4× better performance per chip vs. v6e.
Models with no custom PyTorch/JAX opsPure‑TensorFlow/JAX workloads map cleanly onto TPU hardware.
Google open‑weight models (e.g., Gemma 4, released Apr 2026)Optimized for TPU serving; Google provides JAX reference implementations and community guides for deploying via vLLM on Cloud TPU.

3.3 Workloads Not Suited for TPUs

  • Linear‑algebra programs that require frequent branching or many element‑wise operations.
  • Workloads that need high‑precision arithmetic (e.g., FP64).
  • Neural‑network workloads that contain custom operations in the main training loop.

4. Google’s Current TPU Lineup (as of Cloud Next 2026)

GenerationCodenamePrimary UsePeak Compute*Energy EfficiencyNotable Features
v5eGeneral‑purpose training & inferenceBaseline generation
v6e (Trillium)High‑volume inferenceOptimized memory bandwidth for serving
IronwoodNext‑gen inference performance per chip vs. v6e+67 % vs. v5eFP8 native support, lower latency
v8 (8th‑gen)Massive training pods4.7× peak compute of v5e+67 % energy efficiencyScales to tens of thousands of chips via ICI, integrated with vLLM for serving

*Peak compute values are relative to the v5e generation and are quoted by Google.


5. Quick Takeaways

  • GPU = versatile, mature tooling, works everywhere (cloud, on‑prem). Ideal for dynamic models, mixed workloads, and teams already deep in PyTorch.
  • TPU = specialized for dense matrix math, excels at large‑scale training and high‑throughput inference when the workload fits a static graph. Best on Google Cloud, especially for transformer‑heavy workloads and Google‑published models.

Choose the accelerator that aligns with your framework, workload characteristics, and deployment environment.


Google TPU 8th‑Generation Overview

Key take‑aways

  • Two new chips: TPU 8t (training) and TPU 8i (inference).
  • Both run on Google’s Axion ARM host CPU and use liquid cooling.
  • vLLM now supports the TPU v6e for both offline batch inference and online API serving.

TPU v6e (Inference)

  • 256 chips per pod – still the work‑horse for cost‑sensitive inference workloads.
  • Specs per chip
    • 4,614 FP8 TFLOPS
    • 192 GB HBM3E memory
    • 7.37 TB/s memory bandwidth
    • 9.6 Tb/s inter‑chip interconnect
  • Pod scaling – up to 9,216 chips42.5 FP8 ExaFLOPS per pod (≈ 4× the per‑chip performance of the previous generation).
  • Announced at Google Cloud Next 2025.

TPU 8t – Training Chip

  • Purpose: High‑throughput model training.
  • Pod configuration – 9,600 chips, 2 PB shared HBM memory, 121 FP4 ExaFLOPS compute (≈ 3× the compute per pod of Ironwood).
  • Inter‑chip bandwidth – 19.2 Tb/s per chip (double Ironwood).
  • Network fabricVirgo Network can link 134 k chips within a data‑center and theoretically > 1 M chips across sites.
  • Data‑path enhancements
    • TPUDirect RDMA & TPU Direct Storage bypass the host CPU, doubling bandwidth for large transfers.
  • Efficiency target – 97 % goodput (i.e., 97 % of cycles spent on actual learning).

TPU 8i – Inference Chip

  • Purpose: Low‑latency, high‑throughput inference (especially Mixture‑of‑Experts).
  • Pod configuration – 1,152 chips, 11.6 FP8 ExaFLOPS per pod.
  • Memory – 288 GB HBM per chip (more than the 8t) + 384 MB on‑chip SRAM (3× Ironwood).
  • Performance
    • 80 % better performance‑per‑dollar vs. Ironwood for inference.
    • 2× better performance‑per‑watt.
  • InterconnectBoardfly reduces max network hops from 16 → 7, crucial for MoE models.
  • Compute units – Replaces Ironwood’s SparseCores with a Collectives Acceleration Engine (CAE), cutting collective‑operation latency by .

Why more memory on the inference chip?
Large MoE inference is memory‑bandwidth bound. Serving tokens requires streaming weights and KV‑cache faster than training, so the 8i packs more HBM per chip.


Tooling & Ecosystem

DomainRecommended Tooling
Research & DevelopmentGPUs (mature ecosystem, large community)
Production AI on TPUsJAX, TensorFlow, PyTorch XLA, vLLM (for TPU v6e)
Model Reference ImplementationsMaxText – LLM reference for TPUs (GitHub)
Open‑weight LLMsGemma – DeepMind library (GitHub)
Inference ServingGemma 4 on TPU, custom serving stacks

Further Reading

  • Google Cloud BlogTPU 8t and TPU 8i technical deep dive
  • Google Cloud BlogIronwood: The first Google TPU for the age of inference
  • Google Cloud BlogTraining large models on Ironwood TPUs
  • Google Cloud BlogPerformance per dollar of GPUs and TPUs for AI inference
  • Google Cloud BlogBuilding production AI on Google Cloud TPUs with JAX
  • GitHubMaxText: LLM reference implementation for TPUs
  • GitHubGemma open‑weight LLM library (DeepMind)
  • TechRadarGoogle Cloud unveils eighth‑generation TPUs

TL;DR

  • TPU 8t: massive training pod (9,600 chips, 121 FP4 ExaFLOPS), double inter‑chip bandwidth, Virgo fabric for massive scaling.
  • TPU 8i: inference‑focused pod (1,152 chips, 11.6 FP8 ExaFLOPS), more on‑chip memory, Boardfly interconnect, CAE for fast collectives.
  • Both chips deliver dramatically better performance‑per‑dollar and performance‑per‑watt versus the previous Ironwood generation, and they are fully supported by the latest tooling (JAX, vLLM, MaxText, etc.).

TPUSprint (the original segment’s signature)

0 views
Back to Blog

Related posts

Read more »

The smarter the model, the more it saves.

The Myth: Smarter Models Will Make Plugins Redundant Since WOZCODE launched, many Claude Code power users have whispered that the plugin’s advantage will disap...