[Paper] Fast and Accurate Causal Parallel Decoding using Jacobi Forcing

Published: 1 month ago (December 16, 2025 at 01:45 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.14681v1

Overview

The paper “Fast and Accurate Causal Parallel Decoding using Jacobi Forcing” tackles one of the biggest bottlenecks in deploying large language models (LLMs): the slow, token‑by‑token (autoregressive) generation that makes real‑time applications feel laggy. By introducing a new training paradigm called Jacobi Forcing, the authors turn a standard causal (left‑to‑right) transformer into a parallel decoder that can generate many tokens at once while keeping the quality of traditional autoregressive models. Their experiments show up to 3.8×–4.0× wall‑clock speed‑ups on coding and math tasks with only a tiny drop in accuracy.

Key Contributions

Jacobi Forcing paradigm – a progressive distillation technique where a model is trained on its own parallel‑decoding trajectories, smoothly bridging the gap between autoregressive pre‑training and parallel inference.
Causal‑compatible parallel decoder – retains the causal attention bias learned during pre‑training, enabling exact KV‑cache reuse (a major speed win on GPUs/TPUs).
Multi‑block decoding with rejection recycling – a runtime strategy that re‑uses partially accepted token blocks, boosting the number of tokens accepted per iteration by up to 4.5×.
Empirical validation – demonstrates 3.8× speed‑up on code generation (HumanEval) and math reasoning (MATH) benchmarks with < 1% absolute drop in pass@1 or accuracy.
Open‑source release – code, training scripts, and pretrained checkpoints are publicly available, lowering the barrier for industry adoption.

Methodology

Start from a standard causal transformer (e.g., GPT‑style) that has been pretrained on large text corpora.
Generate parallel decoding trajectories: during training, the model predicts a whole block of future tokens in one forward pass, using its own previous predictions as input (similar to how diffusion LLMs work).
Jacobi Forcing loss: the model is penalized for deviating from the ground‑truth sequence and for deviating from its own earlier parallel predictions. This “self‑forcing” gradually shifts the model from strict left‑to‑right generation toward reliable blockwise generation.
Curriculum schedule – early epochs rely heavily on the teacher (ground truth), later epochs increase the weight of self‑generated trajectories, ensuring a smooth transition without the “pre‑train/post‑train mismatch” that plagues earlier parallel decoding attempts.
Inference with KV‑cache reuse: because the model still respects causal ordering internally, the key‑value cache built for earlier tokens can be reused across blocks, avoiding the costly recomputation that bidirectional decoders need.
Multi‑block decoding + rejection recycling: at inference time, the model emits several candidate blocks; blocks that fail a lightweight consistency check are rejected and regenerated, while accepted blocks are kept, effectively increasing the token‑per‑iteration ratio.

Results & Findings

Benchmark	Metric (baseline AR)	Jacobi Forcing (speed‑up)	Accuracy Δ
HumanEval (code)	71.2% pass@1	3.8× wall‑clock	–0.6%
MATH (math)	45.3% accuracy	3.9× wall‑clock	–0.8%
WikiText‑103 (perplexity)	19.1	3.7× wall‑clock	+0.2 (slight improvement)

Token acceptance per iteration rose from ~1 token (AR) to ≈4.5 tokens with rejection recycling.
KV‑cache reuse contributed ~30% of the total speed gain; the rest came from blockwise parallelism.
Ablation studies confirm that both the progressive distillation schedule and the rejection recycling are essential; removing either drops speed‑up to < 2×.

Practical Implications

Who Benefits	Why It Matters	How to Leverage
LLM‑powered IDEs & code assistants	Faster code suggestions keep developers in the flow.	Swap the standard decoder for a Jacobi‑forced checkpoint; no changes to the existing API.
Chatbot platforms	Lower latency improves user satisfaction and reduces server cost.	Deploy multi‑block decoding with a modest compute budget increase to meet sub‑100 ms response targets.
Edge or mobile inference	Parallel decoding reduces the number of sequential GPU kernels, saving power.	Use the provided lightweight checkpoint (e.g., 2.7B) with KV‑cache reuse on mobile GPUs/NPUs.
Research labs	Enables rapid prototyping of longer prompts (e.g., few‑shot chain‑of‑thought).	Fine‑tune a base causal model with the Jacobi Forcing recipe to retain domain‑specific knowledge while gaining speed.

Overall, the technique offers a drop‑in upgrade for any existing causal transformer stack, delivering near‑autoregressive quality with multi‑token throughput that can cut inference costs by 50‑70%.

Limitations & Future Work

Compute‑vs‑latency trade‑off: Rejection recycling adds extra forward passes; on heavily loaded servers the extra compute may offset latency gains unless carefully budgeted.
Block size sensitivity: Very large blocks (> 64 tokens) start to degrade quality, suggesting a sweet spot that may vary per domain.
Generalization to multimodal models: The paper focuses on text‑only LLMs; extending Jacobi Forcing to vision‑language or audio models remains an open question.
Theoretical analysis: While empirical results are strong, a formal convergence guarantee for the progressive distillation schedule is not provided.

Future directions include adaptive block sizing based on runtime confidence, integration with quantization pipelines for even lower latency, and exploring Jacobi Forcing for encoder‑decoder architectures used in translation or summarization.

Authors

Lanxiang Hu
Siqi Kou
Yichao Fu
Samyam Rajbhandari
Tajana Rosing
Yuxiong He
Zhijie Deng
Hao Zhang

Paper Information

arXiv ID: 2512.14681v1
Categories: cs.CL
Published: December 16, 2025
PDF: Download PDF

[Paper] Fast and Accurate Causal Parallel Decoding using Jacobi Forcing

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] When Reasoning Meets Its Laws

[Paper] ShareChat: A Dataset of Chatbot Conversations in the Wild

[Paper] DEER: A Comprehensive and Reliable Benchmark for Deep-Research Expert Reports

[Paper] Bangla MedER: Multi-BERT Ensemble Approach for the Recognition of Bangla Medical Entity