Supercharging LLM inference on Google TPUs: Achieving 3X speedups with diffusion-style speculative decoding

Published: 3 days ago (May 8, 2026 at 10:05 AM EDT)

1 min read

Source: Google Developers Blog

Overview

Researchers at UCSD have successfully implemented DFlash, a block‑diffusion speculative decoding method, on Google TPUs to bypass the sequential bottlenecks of traditional autoregressive drafting. By “painting” entire blocks of candidate tokens in a single forward pass rather than predicting them one‑by‑one, the system achieved notable speed improvements.

Methodology

Block‑diffusion speculative decoding: paints whole blocks of candidate tokens in one forward pass.
Avoids the step‑by‑step prediction inherent in standard autoregressive drafting.

Performance Gains

Average speedup: 3.13× over prior approaches.
Peak performance: nearly 2× the speed of existing methods such as EAGLE‑3.

Integration with vLLM

The technique is released as an open‑source integration into the vLLM ecosystem.
Leverages “free” parallel verification on TPU hardware.
Provides high‑quality draft predictions that enhance complex reasoning tasks.

Supercharging LLM inference on Google TPUs: Achieving 3X speedups with diffusion-style speculative decoding

Overview

Methodology

Performance Gains

Integration with vLLM

Related posts

Anthropic Says 'Evil' Portrayals of AI Were Responsible For Claude's Blackmail Attempts

Building with Gemini Embedding 2: Agentic multimodal RAG and beyond

LLMs and Text-in-Text Steganography

Hate AI? Survey reveals how many of you pay for AI subscriptions, and you won’t like the answer