[Paper] Parallel Token Prediction for Language Models
Source: arXiv - 2512.21323v1
Overview
The paper introduces Parallel Token Prediction (PTP), a new framework that lets large language models generate several dependent tokens at once instead of one‑by‑one. By folding the sampling logic into the model itself, PTP cuts the latency that normally dominates autoregressive decoding while preserving the full expressive power of the original model.
Key Contributions
- Universal parallel generation – PTP can represent any autoregressive distribution, eliminating the independence assumptions that limit existing multi‑token methods.
- Joint token prediction in a single transformer pass – Multiple tokens are sampled together, dramatically reducing the number of forward passes required for long outputs.
- Two training pathways – (1) Distillation from a pretrained autoregressive teacher, and (2) Inverse autoregressive training that learns directly from data without a teacher.
- Theoretical guarantees – The authors prove that PTP can exactly recover any autoregressive sequence distribution given enough capacity.
- State‑of‑the‑art speculative decoding – On the Vicuna‑7B model, PTP accepts >4 tokens per decoding step on the Spec‑Bench benchmark, outperforming prior speculative decoding baselines.
Methodology
- Embedding the sampling process – Instead of treating token sampling as an external step, PTP augments the transformer’s output layer to emit a joint distribution over a block of k future tokens.
- Conditional factorization – The joint distribution is factorized in a way that respects token dependencies (e.g., using a masked self‑attention mask that only reveals previously predicted tokens within the block).
- Training options
- Distillation: A conventional autoregressive model generates teacher trajectories; PTP learns to match the teacher’s joint distribution over token blocks.
- Inverse autoregressive training: PTP directly maximizes the likelihood of observed sequences under its block‑wise factorization, using a re‑parameterization trick to back‑propagate through the sampling decisions.
- Decoding – At inference time, the model predicts a block of k tokens in one forward pass, then slides the window forward by k positions (or a smaller stride if a rejection step is needed).
The approach is compatible with any transformer architecture (decoder‑only, encoder‑decoder, etc.) and does not require architectural changes beyond the modified output head.
Results & Findings
| Model / Setting | Tokens per step (average) | Speed‑up vs. standard decoding | BLEU / ROUGE (quality) |
|---|---|---|---|
| Vicuna‑7B + PTP (distilled) | 4.2 | ~3.8× | Comparable to baseline (no degradation) |
| Vicuna‑7B + PTP (inverse) | 3.8 | ~3.5× | Slightly higher on open‑ended prompts |
| Spec‑Bench (speculative decoding) | >4 tokens/step | State‑of‑the‑art | Maintains original model’s perplexity |
Key takeaways
- Latency drops dramatically because the number of transformer calls is reduced by a factor equal to the average block size.
- Modeling power is retained – quality metrics stay on par with the original autoregressive model, confirming the theoretical claim of universality.
- Flexibility – Both training regimes work, giving practitioners the option to fine‑tune from an existing model or train from scratch.
Practical Implications
- Faster interactive AI – Chatbots, code assistants, and other real‑time LLM services can respond in near‑real‑time even on commodity GPUs, improving user experience.
- Cost savings – Fewer forward passes translate to lower compute bills for inference‑heavy workloads (e.g., batch generation of documentation or synthetic data).
- Scalable long‑form generation – Applications like story writing, report drafting, or transcript summarization benefit from reduced wall‑clock time without sacrificing coherence.
- Compatibility with existing pipelines – Since PTP is a drop‑in replacement for the decoder head, teams can adopt it without redesigning tokenizers, APIs, or serving infrastructure.
Limitations & Future Work
- Block size trade‑off – Larger blocks increase speed but can amplify error propagation if the joint prediction deviates early; adaptive block sizing is an open question.
- Training overhead – Distillation requires a strong teacher model and extra compute; inverse training mitigates this but may need careful hyper‑parameter tuning.
- Hardware constraints – While the method reduces the number of passes, each pass processes a larger output space, which can strain memory on very large models.
- Future directions – The authors suggest exploring dynamic block prediction, tighter integration with quantization/compression techniques, and extending PTP to multimodal generative models.
Authors
- Felix Draxler
- Justus Will
- Farrin Marouf Sofian
- Theofanis Karaletsos
- Sameer Singh
- Stephan Mandt
Paper Information
- arXiv ID: 2512.21323v1
- Categories: cs.CL, cs.LG
- Published: December 24, 2025
- PDF: Download PDF