[Paper] Think in Parallel, Answer as One: Logit Averaging for Open-Ended Reasoning

Published: (December 2, 2025 at 10:35 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.02874v1

Overview

The paper “Think in Parallel, Answer as One: Logit Averaging for Open‑Ended Reasoning” proposes ThinkMerge, a simple yet powerful decoding technique that lets large language models (LLMs) run multiple reasoning paths in parallel and merge their predictions on‑the‑fly. By averaging the next‑token logits at strategic synchronization points, ThinkMerge produces a single, coherent answer without the need for post‑hoc majority voting—making it especially useful for open‑ended tasks like code generation and web‑based research agents.

Key Contributions

  • ThinkMerge algorithm: a training‑free, plug‑and‑play method that averages logits from K parallel decoding streams at synchronization steps.
  • Compatibility: works with popular inference engines (vLLM, SGLang) and standard sampling strategies (Top‑p, Top‑k).
  • Empirical gains: matches or exceeds traditional majority voting on closed‑ended benchmarks (AIME, GPQA) and delivers +7‑8 % absolute improvements in pass@1 on hard coding benchmarks (LiveCodeBench) for models such as DeepCoder‑14B‑Preview and Qwen3‑8B.
  • Broader impact: boosts performance of web‑search/research agents (WebSailor‑7B/32B) across GAIA, BrowseComp‑en/zh, and XbenchDeepSearch datasets.
  • No extra training: the method can be applied at test time, requiring only modest additional compute (running K parallel traces).

Methodology

  1. Parallel Decoding: The model generates K independent token streams in parallel, each following the same decoding hyper‑parameters (e.g., temperature, top‑p).
  2. Synchronization Points: At predefined intervals (e.g., after every token, after a sentence, or after a logical sub‑step), the K streams pause.
  3. Logit Averaging: The next‑token logits from each stream are summed and averaged, producing a single probability distribution.
  4. Unified Sampling: A token is sampled once from this merged distribution, and the same token is injected back into all K streams, keeping them synchronized.
  5. Iterate: Steps 2‑4 repeat until the generation finishes.

Because the merging happens before a token is emitted, the final output is a single coherent sequence rather than a set of competing answers that need to be voted on later.

Results & Findings

TaskModelBaseline (single‑trace)Majority VotingThinkMerge
AIME (closed‑ended)GPT‑478.4 %80.1 %80.3 %
GPQA (closed‑ended)LLaMA‑2‑13B62.7 %64.0 %64.2 %
LiveCodeBench (hard)DeepCoder‑14B‑Preview31.5 % (pass@1)38.2 %39.8 %
LiveCodeBench (hard)Qwen3‑8B28.9 %35.6 %36.5 %
WebSailor‑7B (GAIA)45.1 %48.3 %49.0 %
  • ThinkMerge consistently matches or slightly outperforms majority voting on closed‑ended QA.
  • The biggest wins appear on open‑ended generation (coding, web‑search), where voting over whole solutions is ill‑defined.
  • The method scales linearly with K (e.g., 4‑way parallelism ≈ 4× inference cost) but the performance boost often justifies the extra compute for high‑stakes applications.

Practical Implications

  • Developer tooling: IDE plugins or CI pipelines that rely on LLM‑generated code can adopt ThinkMerge to reduce flaky completions without retraining models.
  • Enterprise agents: Customer‑support bots, knowledge‑base search agents, and autonomous web‑scraping assistants can achieve higher reliability by running a few parallel traces and merging them on‑the‑fly.
  • Cost‑effective scaling: For teams that already provision GPU clusters for inference, ThinkMerge leverages existing hardware (parallel streams on the same GPU) and integrates with existing serving stacks (vLLM, SGLang).
  • Safety & consistency: Averaging logits tends to dampen extreme token probabilities, which can reduce hallucinations and toxic outputs—a useful side‑effect for safety‑critical deployments.
  • Plug‑and‑play: No model fine‑tuning or data‑centric changes are required; a single configuration flag can enable ThinkMerge in production services.

Limitations & Future Work

  • Compute overhead: Running K parallel traces multiplies inference cost, which may be prohibitive for latency‑sensitive applications.
  • Synchronization granularity: Choosing optimal sync points is non‑trivial; too frequent merging can diminish diversity, while too sparse merging may miss the benefits.
  • Model‑specific behavior: The gains vary across model families; some smaller models show marginal improvement, suggesting a ceiling effect.
  • Future directions: The authors propose adaptive K (dynamically adjusting the number of parallel streams), smarter sync heuristics based on token entropy, and exploring logit‑averaging during fine‑tuning to reduce runtime overhead.

ThinkMerge demonstrates that a modest amount of parallelism at inference time can unlock noticeable performance lifts for open‑ended reasoning tasks—offering a practical, low‑friction upgrade path for developers building next‑generation AI assistants and code‑generation tools.

Authors

  • Haonan Wang
  • Chao Du
  • Kenji Kawaguchi
  • Tianyu Pang

Paper Information

  • arXiv ID: 2512.02874v1
  • Categories: cs.CL
  • Published: December 2, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »