TPU: Why Google Doesn’t Wait in Line for NVIDIA GPUs (2/2)

Published: (December 13, 2025 at 02:39 AM EST)
3 min read
Source: Dev.to

Source: Dev.to

Precision Formats for Deep Learning

Traditional scientific computing uses FP64 (double precision) or FP32 (single precision), which are extremely accurate.
Deep learning, however, does not require that level of precision. Google therefore created bfloat16 (Brain Floating Point), a 16‑bit format that retains the wide dynamic range of FP32 (≈ 1e‑38 to 1e38) while sacrificing some decimal precision.

  • FP16 has a limited range (~6e‑5 to 6e4) and can cause training instability.
  • bfloat16 keeps the FP32 range, making it suitable for AI workloads.

NVIDIA later adopted bfloat16 for its A100 and H100 GPUs.

TPU Pods and Inter‑Chip Interconnect

A single TPU chip excels at matrix multiplication but cannot handle today’s massive models alone. Google groups chips into a hierarchical TPU Pod:

  1. TPU ChipTPU Board
  2. BoardTPU Rack
  3. RackTPU Pod

A pod can contain up to 4,296 chips, which the software presents as a single, massively parallel processor.

Inter‑Chip Interconnect (ICI)

Standard Ethernet is too slow for the constant, low‑latency data exchange required during training. TPU Pods use a dedicated Inter‑Chip Interconnect (ICI) that bypasses the CPU. The chips are wired in a 3‑D torus topology (a “donut” shape), allowing any chip to reach the farthest chip in only a few hops.

Optical Circuit Switch (OCS) – TPU v4

TPU v4 introduced an Optical Circuit Switch (OCS) that eliminates the electrical‑to‑optical conversion step:

  • MEMS mirrors tilt to direct light‑carrying data beams, providing near‑zero latency.
  • Resiliency: If some chips fail, the mirrors can be re‑angled to reroute traffic instantly.

Cooling at Scale

Bundling thousands of chips generates enormous heat. Google employs direct‑to‑chip liquid cooling, running coolant pipes directly on top of the chips—effectively turning data centers into massive aquariums. This approach predates NVIDIA’s recent adoption of liquid cooling for the H100.

Software Stack: From TensorFlow to JAX

ComponentRoleInputOutput
JAX (Frontend)User InterfacePython codeIntermediate Representation (HLO)
XLA (Backend)Compiler EngineHLOComputation graph / binary for TPU/GPU
  • Auto‑differentiation (grad) and vectorization (vmap) are handled by JAX.
  • XLA performs kernel fusion, reducing memory trips and fully exploiting the TPU’s systolic array.

Ironwood TPU (TPU v7)

In 2025 Google announced Ironwood, the 7th‑generation TPU, designed for both LLM inference efficiency and large‑scale training.

  • FP8 support (first TPU with native 8‑bit floating‑point).
  • Compute: 4,614 TFLOPS (FP8).
  • Memory: 192 GB HBM3E per chip, 7.37 TB/s bandwidth (memory‑bandwidth bound workloads).
  • Pod scaling: Up to 9,216 chips per pod.
  • ICI bandwidth: 1.2 TB/s bi‑directional.
  • Power efficiency: ~2× improvement over TPU v6, still using direct‑to‑chip liquid cooling.

Source: Google Cloud Blog – Ironwood TPU

Why GPUs Remain Dominant

  • Software ecosystem: NVIDIA’s CUDA has been evolving since 2006; over 90 % of AI researchers write code for CUDA.
  • Framework optimization: PyTorch is heavily optimized for CUDA. While PyTorch can run on TPUs, the experience is less mature.
  • Accessibility: GPUs can be purchased and installed on‑premise; TPUs are only available via Google Cloud Platform. Organizations already on AWS or Azure find TPUs difficult to adopt.

Thus, despite the TPU’s hardware advantages, the entrenched software stack and accessibility of GPUs keep them at the forefront of AI development.

Back to Blog

Related posts

Read more »