[Paper] AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling for LLMs

Published: (January 9, 2026 at 01:58 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.06022v1

Overview

AdaFuse tackles a practical pain point for anyone deploying large language models (LLMs): how to get the best out of multiple models without costly retraining. By dynamically deciding when and how to fuse the outputs of several LLMs during inference, AdaFuse boosts answer quality across tasks such as QA, reasoning, and translation while keeping the inference pipeline lightweight.

Key Contributions

  • Adaptive fusion granularity – instead of a static, token‑level or sentence‑level merge, AdaFuse decides at each decoding step whether to fuse, based on the model’s confidence.
  • Uncertainty‑driven decision rule – introduces a simple, compute‑friendly metric that flags “uncertain” decoding states and triggers ensemble processing only when needed.
  • Test‑time scaling with diversity awareness – when uncertainty is high, the framework expands the candidate pool (via temperature scaling or top‑k sampling) to explore diverse continuations before fusing.
  • Synergistic loop – the diversity generated by scaling feeds back into better ensemble decisions, creating a virtuous cycle that improves final outputs.
  • Empirical gains – consistent average improvement of ~6.9 % over strong baselines on open‑domain QA, arithmetic reasoning, and machine translation benchmarks.

Methodology

  1. Input & Model Pool – A set of pre‑trained LLMs (different architectures, data, or checkpoints) is kept unchanged.
  2. Step‑wise Decoding – For each token position, each model proposes its next‑token distribution.
  3. Confidence Estimation – AdaFuse computes an uncertainty score (e.g., entropy or margin between top‑k probabilities).
  4. Decision Branch
    • Low uncertainty → pick the token from the most confident model and continue without extra work.
    • High uncertainty → invoke test‑time scaling: increase temperature or sample a larger top‑k set to generate a richer candidate list.
  5. Adaptive Fusion – The candidate lists from all models are aligned at the word level, then combined using a weighted voting scheme that respects the diversity introduced in step 4.
  6. Iterate – The process repeats for the next token, allowing the fusion granularity to shift dynamically throughout the whole generation.

The whole pipeline is implemented as a thin wrapper around existing generation APIs, so it can be dropped into production with minimal code changes.

Results & Findings

TaskBaseline (static ensemble)AdaFuseRelative Gain
Open‑domain QA (TriviaQA)78.4 % EM84.2 % EM+7.4 %
Arithmetic Reasoning (GSM‑8K)62.1 % Acc68.5 % Acc+6.3 %
Machine Translation (WMT‑En‑De)29.8 BLEU31.9 BLEU+7.0 %

Key takeaways

  • Selective ensembling saves compute (≈30 % fewer forward passes) because many tokens are generated without fusion.
  • Diversity‑aware scaling prevents the ensemble from collapsing to the same dominant hypothesis, especially on ambiguous or multi‑step problems.
  • The approach works across very different downstream tasks, indicating that the uncertainty signal is robust.

Practical Implications

  • Cost‑effective performance boost – Developers can improve LLM outputs without training a larger model or fine‑tuning ensembles; the extra inference overhead is only incurred on “hard” tokens.
  • Plug‑and‑play integration – Since AdaFuse works at the decoding level, it can be added to existing inference services (e.g., OpenAI API wrappers, Hugging Face pipelines) with a few lines of code.
  • Dynamic resource allocation – In latency‑sensitive environments, the uncertainty threshold can be tuned to trade off speed vs. quality, enabling adaptive throttling based on SLA requirements.
  • Better handling of edge cases – Tasks that involve multi‑step reasoning or rare vocabularies benefit from the extra exploration, reducing hallucinations and improving factuality.

Limitations & Future Work

  • Threshold sensitivity – The uncertainty cutoff needs empirical tuning per task; a suboptimal setting can either waste compute or miss improvements.
  • Scalability with many models – While AdaFuse reduces unnecessary fusion, the worst‑case scenario still requires running all models in parallel for highly uncertain tokens, which may strain GPU memory.
  • Diversity metric simplicity – Current scaling relies on temperature/top‑k; more sophisticated diversity‑promoting samplers (e.g., nucleus sampling with entropy regularization) could further enhance performance.
  • Broader evaluation – Future work could explore code generation, dialog systems, and multimodal LLMs, as well as automated methods for learning the uncertainty threshold.

AdaFuse opens a pragmatic path for developers to squeeze extra performance out of their existing LLM fleets, turning inference-time uncertainty into a lever for smarter, cheaper ensembling.

Authors

  • Chengming Cui
  • Tianxin Wei
  • Ziyi Chen
  • Ruizhong Qiu
  • Zhichen Zeng
  • Zhining Liu
  • Xuying Ning
  • Duo Zhou
  • Jingrui He

Paper Information

  • arXiv ID: 2601.06022v1
  • Categories: cs.CL, cs.AI
  • Published: January 9, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »