Researchers baked 3x inference speedups directly into LLM weights — without speculative decoding
Source: VentureBeat
Multi‑Token Prediction (MTP) — Boosting Throughput for Agentic AI Workflows
As agentic AI workflows multiply the cost and latency of long reasoning chains, a team from the University of Maryland, Lawrence Livermore National Labs, Columbia University, and TogetherAI has found a way to bake 3× throughput gains directly into a model’s weights. Unlike speculative decoding, which requires a separate drafting model, this approach needs no additional infrastructure—just a single special token added to the model’s existing architecture.
The Limits of Next‑Token Prediction
Next‑token prediction (NTP) – generating one token per forward pass – creates a throughput ceiling that becomes painfully expensive when models must produce thousands of tokens. This bottleneck is especially problematic for reasoning models, which frequently generate long “chain‑of‑thought” sequences before producing the final answer, leading to a slow and costly user experience.
Multi‑Token Prediction (MTP)
MTP offers an alternative training paradigm that allows a language model to produce multiple tokens simultaneously in a single forward pass. For example, the model can be trained to predict an entire block of tokens at once instead of just the immediate next token.
Quote – John Kirchenbauer, doctorate candidate in computer science at the University of Maryland and co‑author of the paper (as reported by VentureBeat):
“Today, with ultra‑long thinking traces being the norm and agentic outer loops multiplying out those costs even further, latency is becoming as equally important a dimension of overall serving efficiency as gross tokens per second per hardware unit (tps/GPU).”
He added that while standard batched NTP is already optimal for overall throughput, the new approach “strives to saturate the GPU with just a single user’s query to decrease latency for that single user.”
Why Not Existing Methods?
Other efficiency‑focused techniques exist, but they come with drawbacks:
| Method | Focus | Drawbacks |
|---|---|---|
| Speculative Decoding | Latency reduction | Requires an auxiliary “drafting” model; spends extra compute to draft and verify. |
| Diffusion LLMs | Latency reduction | Also needs additional infrastructure and introduces complexity. |
Kirchenbauer: “It’s worth noting that speculative decoding and diffusion LLMs are both latency‑focused acceleration techniques. MTP leverages a similar trade‑off but is simpler to serve and scientifically interesting in its own right.”
Limitations of Current MTP Paradigms
The standard MTP training objective compares model predictions against ground‑truth text from a dataset. This approach teaches the model to predict the probability of a token at a specific position independently, ignoring the joint relationship among a sequence of tokens.
Two Major Problems
-
Grammatical Mismatch
Example: Given the prefix “The zookeeper fed the,” a naïve MTP model might independently sample “panda meat” or “lion bamboo” instead of the coherent pairs “panda bamboo” and “lion meat.” -
Degenerate Repetition
When predicting far‑future positions (e.g., 100 tokens ahead), the model tends to output the most common word (“the”), leading to nonsensical repetitions such as “…the the the…”.
Multi‑Token Prediction via Self‑Distillation
To overcome these issues, the researchers propose a student‑teacher scheme:
- Student Model – learns to predict multiple tokens at once, generating a deterministic block.
- Teacher Model – a strong standard NTP language model that evaluates the student’s block, acting as a critic.
The teacher computes a loss based on likelihood and coherence. If the student proposes a mismatched phrase like “lion bamboo,” the teacher assigns a high loss, teaching the student to avoid such constructions.
Key Characteristics
- On‑Policy Reinforcement Learning Inspiration – The student does not memorize static text; it generates a full rollout (a sequence of actions) in parallel on a single forward pass and receives a reward from the teacher.
- Dynamic Feedback – Unlike static supervised pairs, the feedback is generated in real time from the student’s own outputs.
- Coherence Enforcement – The teacher prevents degenerate outputs (e.g., repeated words) by verifying token relationships.
Simplicity for Developers
“There are truly no modifications to the architecture except for the addition of a special token,” says Kirchenbauer.
- An unused slot in the model’s embedding matrix is repurposed as an “ mask token.
- This converts sequential operations into parallel ones without touching internal components (MoE, windowed attention, SSM layers, etc.).
- Any standard next‑token prediction language model can be adapted; pipelines remain untouched, enabling seamless production deployment.
Adaptive Decoding: ConfAdapt
Generating multiple tokens simultaneously can still hurt accuracy at inference time. To balance speed and quality, the authors introduce ConfAdapt, an adaptive decoding strategy:
- Confidence Threshold – e.g., 90 %.
- Block Generation – The model produces a token block.
- Selective Acceptance – Only tokens meeting or exceeding the confidence threshold are kept; lower‑confidence tokens are discarded or regenerated.
ConfAdapt thus maximizes generation speed without sacrificing output quality.
Takeaways
- MTP with self‑distillation provides a simple, infrastructure‑light path to 3× throughput gains.
- By adding a single “ token, existing models can be transformed to predict multiple tokens in parallel.
- ConfAdapt ensures that the speed boost does not come at the cost of degraded responses.
This approach promises lower latency for single‑user queries while preserving the overall efficiency of large‑scale language‑model serving.
Putting Multi‑Token Prediction to the Test
To see how the training paradigm performed in practice, the researchers applied their method to popular open‑weight instruction‑tuned models. They tested:
- Llama‑3.1‑8B‑Magpie – a strong general‑purpose model.
- Qwen3‑4B‑Instruct‑2507 – a smaller, efficient model often chosen for cost‑sensitive enterprise deployments.
Both models were fine‑tuned on MetaMathQA, a dataset of synthetic grade‑school math problems that rely heavily on reasoning traces.
Results
| Model | Speed‑up | Accuracy Drop* |
|---|---|---|
| Llama‑3.1‑8B‑Magpie | 3× | “As the ConfAdapt approach naturally tailors the acceleration to the inherent entropy in the domain, when the model knows exactly what comes next it can emit it in a single pass,” – Kirchenbauer |
- Predictable tasks (low entropy) → massive acceleration because the model can output many tokens at once.
- Uncertain outputs → the model falls back to more single‑token passes, preserving quality.
Transfer Across Domains
The speedups transferred to domains not seen during multi‑token prediction training, including:
- Math and reasoning (same domain as training data)
- Open‑ended tasks such as creative writing and summarization
Recommendations for Enterprises
- Do not rely solely on transfer learning.
- “Our recommendation would be to tune/adapt the model for MTP using samples from the specific industrial domain,” says Kirchenbauer.
- Best performance is achieved when MTP adaptation uses prompts from the deployment domain.
Serving Compatibility and the Road Ahead
- The research team has released the trained models on Hugging Face and will soon open‑source the MTP framework code.
- Integration with inference stacks like vLLM or SGLang will require adjustments to batching and KV‑caching logic—a one‑time engineering effort, not an ongoing burden.
- Kirchenbauer notes “no clear barriers to integration” and mentions ongoing collaboration with systems experts to find the shortest path to deployment.
Practical Advice for Teams
- Start with toy prompts (e.g., counting or repeating a phrase) to observe ConfAdapt’s gains.
- Adapt the model using samples from your specific deployment domain for optimal results.
“Overall we do expect that a production‑ready implementation of our approach could simplify the lifecycle of building and deploying low‑latency agentic models,” concludes Kirchenbauer. “While existing acceleration techniques for NTP models focus almost solely on inference harnesses and logic, our approach bakes some of the complexity into the model itself, making it largely complementary to existing work.”