Supercharging LLM inference on Google TPUs: Achieving 3X speedups with diffusion-style speculative decoding

Published: (May 8, 2026 at 10:05 AM EDT)
1 min read

Source: Google Developers Blog

Overview

Researchers at UCSD have successfully implemented DFlash, a block‑diffusion speculative decoding method, on Google TPUs to bypass the sequential bottlenecks of traditional autoregressive drafting. By “painting” entire blocks of candidate tokens in a single forward pass rather than predicting them one‑by‑one, the system achieved notable speed improvements.

Methodology

  • Block‑diffusion speculative decoding: paints whole blocks of candidate tokens in one forward pass.
  • Avoids the step‑by‑step prediction inherent in standard autoregressive drafting.

Performance Gains

  • Average speedup: 3.13× over prior approaches.
  • Peak performance: nearly the speed of existing methods such as EAGLE‑3.

Integration with vLLM

  • The technique is released as an open‑source integration into the vLLM ecosystem.
  • Leverages “free” parallel verification on TPU hardware.
  • Provides high‑quality draft predictions that enhance complex reasoning tasks.
0 views
Back to Blog

Related posts

Read more »

LLMs and Text-in-Text Steganography

Comments Privacy – May 11, 2026 8:07 AM To hide text, try white text on a white background. The human eye won’t see it but the computer will. If you want to te...