[Paper] Fast and Accurate Causal Parallel Decoding using Jacobi Forcing

Published: (December 16, 2025 at 01:45 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.14681v1

Overview

The paper “Fast and Accurate Causal Parallel Decoding using Jacobi Forcing” tackles one of the biggest bottlenecks in deploying large language models (LLMs): the slow, token‑by‑token (autoregressive) generation that makes real‑time applications feel laggy. By introducing a new training paradigm called Jacobi Forcing, the authors turn a standard causal (left‑to‑right) transformer into a parallel decoder that can generate many tokens at once while keeping the quality of traditional autoregressive models. Their experiments show up to 3.8×–4.0× wall‑clock speed‑ups on coding and math tasks with only a tiny drop in accuracy.

Key Contributions

  • Jacobi Forcing paradigm – a progressive distillation technique where a model is trained on its own parallel‑decoding trajectories, smoothly bridging the gap between autoregressive pre‑training and parallel inference.
  • Causal‑compatible parallel decoder – retains the causal attention bias learned during pre‑training, enabling exact KV‑cache reuse (a major speed win on GPUs/TPUs).
  • Multi‑block decoding with rejection recycling – a runtime strategy that re‑uses partially accepted token blocks, boosting the number of tokens accepted per iteration by up to 4.5×.
  • Empirical validation – demonstrates 3.8× speed‑up on code generation (HumanEval) and math reasoning (MATH) benchmarks with < 1% absolute drop in pass@1 or accuracy.
  • Open‑source release – code, training scripts, and pretrained checkpoints are publicly available, lowering the barrier for industry adoption.

Methodology

  1. Start from a standard causal transformer (e.g., GPT‑style) that has been pretrained on large text corpora.
  2. Generate parallel decoding trajectories: during training, the model predicts a whole block of future tokens in one forward pass, using its own previous predictions as input (similar to how diffusion LLMs work).
  3. Jacobi Forcing loss: the model is penalized for deviating from the ground‑truth sequence and for deviating from its own earlier parallel predictions. This “self‑forcing” gradually shifts the model from strict left‑to‑right generation toward reliable blockwise generation.
  4. Curriculum schedule – early epochs rely heavily on the teacher (ground truth), later epochs increase the weight of self‑generated trajectories, ensuring a smooth transition without the “pre‑train/post‑train mismatch” that plagues earlier parallel decoding attempts.
  5. Inference with KV‑cache reuse: because the model still respects causal ordering internally, the key‑value cache built for earlier tokens can be reused across blocks, avoiding the costly recomputation that bidirectional decoders need.
  6. Multi‑block decoding + rejection recycling: at inference time, the model emits several candidate blocks; blocks that fail a lightweight consistency check are rejected and regenerated, while accepted blocks are kept, effectively increasing the token‑per‑iteration ratio.

Results & Findings

BenchmarkMetric (baseline AR)Jacobi Forcing (speed‑up)Accuracy Δ
HumanEval (code)71.2% pass@13.8× wall‑clock–0.6%
MATH (math)45.3% accuracy3.9× wall‑clock–0.8%
WikiText‑103 (perplexity)19.13.7× wall‑clock+0.2 (slight improvement)
  • Token acceptance per iteration rose from ~1 token (AR) to ≈4.5 tokens with rejection recycling.
  • KV‑cache reuse contributed ~30% of the total speed gain; the rest came from blockwise parallelism.
  • Ablation studies confirm that both the progressive distillation schedule and the rejection recycling are essential; removing either drops speed‑up to < 2×.

Practical Implications

Who BenefitsWhy It MattersHow to Leverage
LLM‑powered IDEs & code assistantsFaster code suggestions keep developers in the flow.Swap the standard decoder for a Jacobi‑forced checkpoint; no changes to the existing API.
Chatbot platformsLower latency improves user satisfaction and reduces server cost.Deploy multi‑block decoding with a modest compute budget increase to meet sub‑100 ms response targets.
Edge or mobile inferenceParallel decoding reduces the number of sequential GPU kernels, saving power.Use the provided lightweight checkpoint (e.g., 2.7B) with KV‑cache reuse on mobile GPUs/NPUs.
Research labsEnables rapid prototyping of longer prompts (e.g., few‑shot chain‑of‑thought).Fine‑tune a base causal model with the Jacobi Forcing recipe to retain domain‑specific knowledge while gaining speed.

Overall, the technique offers a drop‑in upgrade for any existing causal transformer stack, delivering near‑autoregressive quality with multi‑token throughput that can cut inference costs by 50‑70%.

Limitations & Future Work

  • Compute‑vs‑latency trade‑off: Rejection recycling adds extra forward passes; on heavily loaded servers the extra compute may offset latency gains unless carefully budgeted.
  • Block size sensitivity: Very large blocks (> 64 tokens) start to degrade quality, suggesting a sweet spot that may vary per domain.
  • Generalization to multimodal models: The paper focuses on text‑only LLMs; extending Jacobi Forcing to vision‑language or audio models remains an open question.
  • Theoretical analysis: While empirical results are strong, a formal convergence guarantee for the progressive distillation schedule is not provided.

Future directions include adaptive block sizing based on runtime confidence, integration with quantization pipelines for even lower latency, and exploring Jacobi Forcing for encoder‑decoder architectures used in translation or summarization.

Authors

  • Lanxiang Hu
  • Siqi Kou
  • Yichao Fu
  • Samyam Rajbhandari
  • Tajana Rosing
  • Yuxiong He
  • Zhijie Deng
  • Hao Zhang

Paper Information

  • arXiv ID: 2512.14681v1
  • Categories: cs.CL
  • Published: December 16, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »