[Paper] AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling for LLMs

Published: 1 month ago (January 9, 2026 at 01:58 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.06022v1

Overview

AdaFuse tackles a practical pain point for anyone deploying large language models (LLMs): how to get the best out of multiple models without costly retraining. By dynamically deciding when and how to fuse the outputs of several LLMs during inference, AdaFuse boosts answer quality across tasks such as QA, reasoning, and translation while keeping the inference pipeline lightweight.

Key Contributions

Adaptive fusion granularity – instead of a static, token‑level or sentence‑level merge, AdaFuse decides at each decoding step whether to fuse, based on the model’s confidence.
Uncertainty‑driven decision rule – introduces a simple, compute‑friendly metric that flags “uncertain” decoding states and triggers ensemble processing only when needed.
Test‑time scaling with diversity awareness – when uncertainty is high, the framework expands the candidate pool (via temperature scaling or top‑k sampling) to explore diverse continuations before fusing.
Synergistic loop – the diversity generated by scaling feeds back into better ensemble decisions, creating a virtuous cycle that improves final outputs.
Empirical gains – consistent average improvement of ~6.9 % over strong baselines on open‑domain QA, arithmetic reasoning, and machine translation benchmarks.

Methodology

Input & Model Pool – A set of pre‑trained LLMs (different architectures, data, or checkpoints) is kept unchanged.
Step‑wise Decoding – For each token position, each model proposes its next‑token distribution.
Confidence Estimation – AdaFuse computes an uncertainty score (e.g., entropy or margin between top‑k probabilities).
Decision Branch
- Low uncertainty → pick the token from the most confident model and continue without extra work.
- High uncertainty → invoke test‑time scaling: increase temperature or sample a larger top‑k set to generate a richer candidate list.
Adaptive Fusion – The candidate lists from all models are aligned at the word level, then combined using a weighted voting scheme that respects the diversity introduced in step 4.
Iterate – The process repeats for the next token, allowing the fusion granularity to shift dynamically throughout the whole generation.

The whole pipeline is implemented as a thin wrapper around existing generation APIs, so it can be dropped into production with minimal code changes.

Results & Findings

Task	Baseline (static ensemble)	AdaFuse	Relative Gain
Open‑domain QA (TriviaQA)	78.4 % EM	84.2 % EM	+7.4 %
Arithmetic Reasoning (GSM‑8K)	62.1 % Acc	68.5 % Acc	+6.3 %
Machine Translation (WMT‑En‑De)	29.8 BLEU	31.9 BLEU	+7.0 %

Key takeaways

Selective ensembling saves compute (≈30 % fewer forward passes) because many tokens are generated without fusion.
Diversity‑aware scaling prevents the ensemble from collapsing to the same dominant hypothesis, especially on ambiguous or multi‑step problems.
The approach works across very different downstream tasks, indicating that the uncertainty signal is robust.

Practical Implications

Cost‑effective performance boost – Developers can improve LLM outputs without training a larger model or fine‑tuning ensembles; the extra inference overhead is only incurred on “hard” tokens.
Plug‑and‑play integration – Since AdaFuse works at the decoding level, it can be added to existing inference services (e.g., OpenAI API wrappers, Hugging Face pipelines) with a few lines of code.
Dynamic resource allocation – In latency‑sensitive environments, the uncertainty threshold can be tuned to trade off speed vs. quality, enabling adaptive throttling based on SLA requirements.
Better handling of edge cases – Tasks that involve multi‑step reasoning or rare vocabularies benefit from the extra exploration, reducing hallucinations and improving factuality.

Limitations & Future Work

Threshold sensitivity – The uncertainty cutoff needs empirical tuning per task; a suboptimal setting can either waste compute or miss improvements.
Scalability with many models – While AdaFuse reduces unnecessary fusion, the worst‑case scenario still requires running all models in parallel for highly uncertain tokens, which may strain GPU memory.
Diversity metric simplicity – Current scaling relies on temperature/top‑k; more sophisticated diversity‑promoting samplers (e.g., nucleus sampling with entropy regularization) could further enhance performance.
Broader evaluation – Future work could explore code generation, dialog systems, and multimodal LLMs, as well as automated methods for learning the uncertainty threshold.

AdaFuse opens a pragmatic path for developers to squeeze extra performance out of their existing LLM fleets, turning inference-time uncertainty into a lever for smarter, cheaper ensembling.

Authors

Chengming Cui
Tianxin Wei
Ziyi Chen
Ruizhong Qiu
Zhichen Zeng
Zhining Liu
Xuying Ning
Duo Zhou
Jingrui He

Paper Information

arXiv ID: 2601.06022v1
Categories: cs.CL, cs.AI
Published: January 9, 2026
PDF: Download PDF

[Paper] AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling for LLMs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning

[Paper] Can We Predict Before Executing Machine Learning Agents?

[Paper] Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency

[Paper] An Empirical Study on Preference Tuning Generalization and Diversity Under Domain Shift