Supercharging LLM inference on Google TPUs: Achieving 3X speedups with diffusion-style speculative decoding
Source: Google Developers Blog
Overview
Researchers at UCSD have successfully implemented DFlash, a block‑diffusion speculative decoding method, on Google TPUs to bypass the sequential bottlenecks of traditional autoregressive drafting. By “painting” entire blocks of candidate tokens in a single forward pass rather than predicting them one‑by‑one, the system achieved notable speed improvements.
Methodology
- Block‑diffusion speculative decoding: paints whole blocks of candidate tokens in one forward pass.
- Avoids the step‑by‑step prediction inherent in standard autoregressive drafting.
Performance Gains
- Average speedup: 3.13× over prior approaches.
- Peak performance: nearly 2× the speed of existing methods such as EAGLE‑3.
Integration with vLLM
- The technique is released as an open‑source integration into the vLLM ecosystem.
- Leverages “free” parallel verification on TPU hardware.
- Provides high‑quality draft predictions that enhance complex reasoning tasks.