TPUs vs. GPUs: What They Are, How They Differ, and Which Workloads Belong on Each

Published: 4 days ago (April 30, 2026 at 09:53 PM EDT)

9 min read

Source: Dev.to

Source: Dev.to

GPU vs. TPU on Google Cloud

If you’ve worked with machine learning on Google Cloud, you’ve hit the choice: GPU instance or TPU?
Most teams default to GPU because that’s what they already know. But as inference costs climb and TPU tooling matures, it’s worth understanding what each chip actually does and when one outperforms the other.

This post covers

What GPUs and TPUs are
How they work
Which workloads run better on each
Google’s current TPU lineup (including the eighth‑generation chips announced at Google Cloud Next 2026)

Image placeholder

Image source: Google Cloud

1. Background

GPUs

Originally built for rendering video games.
Handle AI workloads well because the underlying math—large‑scale parallel floating‑point operations—is the same.
Researchers realized this around 2012, and GPUs quickly became the default for training neural networks.

TPUs

In 2013, Google Brain engineers calculated that if every Android user used voice search for just three minutes a day, Google would need to double its global data‑center capacity.
Running inference on general‑purpose GPUs at that scale was too expensive and power‑hungry.
Google’s solution: a chip designed specifically for neural‑network math.
The first TPU entered production in Google’s data centers in 2015 and became publicly available as Cloud TPU in 2018.
Core idea: strip out everything a GPU carries from its graphics origins and focus entirely on matrix multiplication—a principle that still drives every TPU generation today.

2. How the Chips Work

2.1 GPU Architecture

A parallel processor with thousands of smaller cores.
Compared to a CPU (8–64 powerful general‑purpose cores), a high‑end GPU like the NVIDIA H100 has thousands of cores that run the same instruction across many data points at once (SIMD – Single Instruction, Multiple Data).
Precision formats supported: FP32, FP16, BF16, INT8, FP8.
Runs PyTorch, TensorFlow, JAX, CUDA libraries, simulations, rendering pipelines, etc.
Because of this broad support, a GPU carries hardware for texture mapping, branch prediction, and other operations that sit idle during pure matrix multiplication.
Memory: The NVIDIA H100 ships with 80 GB of HBM2e on‑package. Memory bandwidth is often the bottleneck for AI workloads, not raw compute.

2.2 TPU Architecture

Built for one job: tensor math, specifically the matrix multiplications at the core of neural‑network training and inference.
Key hardware: the systolic array.
- In a standard processor, each operation reads inputs from memory, computes, and writes the result back.
- In a systolic array, data flows through a grid of multiply‑and‑accumulate units. You load the weights once, pass inputs through the grid, and results flow from unit to unit without returning to main memory, eliminating constant memory round‑trips.
Precision: Google added BF16 support from early generations; GPUs added it later. Recent chips (both GPU and TPU) support FP8 natively, boosting inference throughput.
Limitations:
- Poor with dynamic control flow, variable‑length sequences, and custom operations.
- Best suited for static computation graphs, which is what most transformer models produce.

3. Choosing the Right Accelerator

3.1 When GPUs Are Recommended

Workload Type	Reason
PyTorch‑first teams	Most research code, open‑source checkpoints, and fine‑tuning guides assume a GPU.
TensorFlow ops not on Cloud TPU	Some TF ops are unavailable on TPU (see Google’s op‑list).
Dynamic inputs (variable‑length sequences, conditional branches, custom CUDA extensions)	GPUs handle these gracefully; TPUs can be tricky.
Medium‑to‑large models with larger effective batch sizes	GPUs scale well with batch size.
Multi‑cloud or on‑prem deployments	TPUs exist only on Google Cloud.
Mixed workloads (ML training + scientific simulation + rendering)	GPUs are general‑purpose; TPUs are specialized.
Small teams moving fast	GPU tooling (profilers, debuggers, community tutorials) is more mature; diagnosing performance issues is easier.

3.2 When TPUs Shine

Workload Type	Reason
Training massive deep‑learning models (e.g., large language models)	TPUs handle the immense number of matrix calculations efficiently.
Models dominated by matrix computations	Systolic array excels at dense linear algebra.
Long‑running training jobs (weeks or months)	TPU pods provide high throughput and lower cost per token.
Ultra‑large embeddings (advanced ranking & recommendation)	TPUs’ memory architecture is optimized for large weight matrices.
Large‑scale transformer training	TPU pods scale to tens of thousands of chips via Google’s Inter‑Chip Interconnect (ICI); training something like Gemma on a TPU pod is often faster and cheaper than on a GPU cluster.
High‑volume production inference	TPU v6e (Trillium) and Ironwood are built specifically for inference; Ironwood delivers >4× better performance per chip vs. v6e.
Models with no custom PyTorch/JAX ops	Pure‑TensorFlow/JAX workloads map cleanly onto TPU hardware.
Google open‑weight models (e.g., Gemma 4, released Apr 2026)	Optimized for TPU serving; Google provides JAX reference implementations and community guides for deploying via vLLM on Cloud TPU.

3.3 Workloads Not Suited for TPUs

Linear‑algebra programs that require frequent branching or many element‑wise operations.
Workloads that need high‑precision arithmetic (e.g., FP64).
Neural‑network workloads that contain custom operations in the main training loop.

4. Google’s Current TPU Lineup (as of Cloud Next 2026)

Generation	Codename	Primary Use	Peak Compute*	Energy Efficiency	Notable Features
v5e	–	General‑purpose training & inference	–	–	Baseline generation
v6e (Trillium)	–	High‑volume inference	–	–	Optimized memory bandwidth for serving
Ironwood	–	Next‑gen inference	4× performance per chip vs. v6e	+67 % vs. v5e	FP8 native support, lower latency
v8 (8th‑gen)	–	Massive training pods	4.7× peak compute of v5e	+67 % energy efficiency	Scales to tens of thousands of chips via ICI, integrated with vLLM for serving

*Peak compute values are relative to the v5e generation and are quoted by Google.

5. Quick Takeaways

GPU = versatile, mature tooling, works everywhere (cloud, on‑prem). Ideal for dynamic models, mixed workloads, and teams already deep in PyTorch.
TPU = specialized for dense matrix math, excels at large‑scale training and high‑throughput inference when the workload fits a static graph. Best on Google Cloud, especially for transformer‑heavy workloads and Google‑published models.

Choose the accelerator that aligns with your framework, workload characteristics, and deployment environment.

Google TPU 8th‑Generation Overview

Key take‑aways

Two new chips: TPU 8t (training) and TPU 8i (inference).
Both run on Google’s Axion ARM host CPU and use liquid cooling.
vLLM now supports the TPU v6e for both offline batch inference and online API serving.

TPU v6e (Inference)

256 chips per pod – still the work‑horse for cost‑sensitive inference workloads.
Specs per chip
- 4,614 FP8 TFLOPS
- 192 GB HBM3E memory
- 7.37 TB/s memory bandwidth
- 9.6 Tb/s inter‑chip interconnect
Pod scaling – up to 9,216 chips → 42.5 FP8 ExaFLOPS per pod (≈ 4× the per‑chip performance of the previous generation).
Announced at Google Cloud Next 2025.

TPU 8t – Training Chip

Purpose: High‑throughput model training.
Pod configuration – 9,600 chips, 2 PB shared HBM memory, 121 FP4 ExaFLOPS compute (≈ 3× the compute per pod of Ironwood).
Inter‑chip bandwidth – 19.2 Tb/s per chip (double Ironwood).
Network fabric – Virgo Network can link 134 k chips within a data‑center and theoretically > 1 M chips across sites.
Data‑path enhancements
- TPUDirect RDMA & TPU Direct Storage bypass the host CPU, doubling bandwidth for large transfers.
Efficiency target – 97 % goodput (i.e., 97 % of cycles spent on actual learning).

TPU 8i – Inference Chip

Purpose: Low‑latency, high‑throughput inference (especially Mixture‑of‑Experts).
Pod configuration – 1,152 chips, 11.6 FP8 ExaFLOPS per pod.
Memory – 288 GB HBM per chip (more than the 8t) + 384 MB on‑chip SRAM (3× Ironwood).
Performance
- 80 % better performance‑per‑dollar vs. Ironwood for inference.
- 2× better performance‑per‑watt.
Interconnect – Boardfly reduces max network hops from 16 → 7, crucial for MoE models.
Compute units – Replaces Ironwood’s SparseCores with a Collectives Acceleration Engine (CAE), cutting collective‑operation latency by 5×.

Why more memory on the inference chip?
Large MoE inference is memory‑bandwidth bound. Serving tokens requires streaming weights and KV‑cache faster than training, so the 8i packs more HBM per chip.

Tooling & Ecosystem

Domain	Recommended Tooling
Research & Development	GPUs (mature ecosystem, large community)
Production AI on TPUs	JAX, TensorFlow, PyTorch XLA, vLLM (for TPU v6e)
Model Reference Implementations	MaxText – LLM reference for TPUs (GitHub)
Open‑weight LLMs	Gemma – DeepMind library (GitHub)
Inference Serving	Gemma 4 on TPU, custom serving stacks

TPUs vs. GPUs: What They Are, How They Differ, and Which Workloads Belong on Each

GPU vs. TPU on Google Cloud

This post covers

Image placeholder

1. Background

GPUs

TPUs

2. How the Chips Work

2.1 GPU Architecture

2.2 TPU Architecture

3. Choosing the Right Accelerator

3.1 When GPUs Are Recommended

3.2 When TPUs Shine

3.3 Workloads Not Suited for TPUs

4. Google’s Current TPU Lineup (as of Cloud Next 2026)

5. Quick Takeaways

Google TPU 8th‑Generation Overview

Key take‑aways

TPU v6e (Inference)

TPU 8t – Training Chip

TPU 8i – Inference Chip

Tooling & Ecosystem

Further Reading

TL;DR

Related posts

The smarter the model, the more it saves.

Caching AI Responses in a Desktop App — Don't Pay Twice for the Same Question

LLM386: borrowing a 1990s idea for managing LLM context

Token Consumption Anxiety and the Open Source App I Built to Solve It

GPU vs. TPU on Google Cloud

This post covers

Image placeholder

1. Background

GPUs

TPUs

2. How the Chips Work

2.1 GPU Architecture

2.2 TPU Architecture

3. Choosing the Right Accelerator

3.1 When GPUs Are Recommended

3.2 When TPUs Shine

3.3 Workloads Not Suited for TPUs

4. Google’s Current TPU Lineup (as of Cloud Next 2026)

5. Quick Takeaways

Google TPU 8th‑Generation Overview

Key take‑aways

TPU v6e (Inference)

TPU 8t – Training Chip

TPU 8i – Inference Chip

Tooling & Ecosystem

Further Reading

TL;DR

Related posts

The smarter the model, the more it saves.

Caching AI Responses in a Desktop App — Don't Pay Twice for the Same Question

LLM386: borrowing a 1990s idea for managing LLM context

Token Consumption Anxiety and the Open Source App I Built to Solve It

4. Google’s Current TPU Lineup (as of Cloud Next 2026)

Google TPU 8th‑Generation Overview

TPU v6e (Inference)

TPU 8t – Training Chip

TPU 8i – Inference Chip