[Paper] Think in Parallel, Answer as One: Logit Averaging for Open-Ended Reasoning

Published: 2 months ago (December 2, 2025 at 10:35 AM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.02874v1

Overview

The paper “Think in Parallel, Answer as One: Logit Averaging for Open‑Ended Reasoning” proposes ThinkMerge, a simple yet powerful decoding technique that lets large language models (LLMs) run multiple reasoning paths in parallel and merge their predictions on‑the‑fly. By averaging the next‑token logits at strategic synchronization points, ThinkMerge produces a single, coherent answer without the need for post‑hoc majority voting—making it especially useful for open‑ended tasks like code generation and web‑based research agents.

Key Contributions

ThinkMerge algorithm: a training‑free, plug‑and‑play method that averages logits from K parallel decoding streams at synchronization steps.
Compatibility: works with popular inference engines (vLLM, SGLang) and standard sampling strategies (Top‑p, Top‑k).
Empirical gains: matches or exceeds traditional majority voting on closed‑ended benchmarks (AIME, GPQA) and delivers +7‑8 % absolute improvements in pass@1 on hard coding benchmarks (LiveCodeBench) for models such as DeepCoder‑14B‑Preview and Qwen3‑8B.
Broader impact: boosts performance of web‑search/research agents (WebSailor‑7B/32B) across GAIA, BrowseComp‑en/zh, and XbenchDeepSearch datasets.
No extra training: the method can be applied at test time, requiring only modest additional compute (running K parallel traces).

Methodology

Parallel Decoding: The model generates K independent token streams in parallel, each following the same decoding hyper‑parameters (e.g., temperature, top‑p).
Synchronization Points: At predefined intervals (e.g., after every token, after a sentence, or after a logical sub‑step), the K streams pause.
Logit Averaging: The next‑token logits from each stream are summed and averaged, producing a single probability distribution.
Unified Sampling: A token is sampled once from this merged distribution, and the same token is injected back into all K streams, keeping them synchronized.
Iterate: Steps 2‑4 repeat until the generation finishes.

Because the merging happens before a token is emitted, the final output is a single coherent sequence rather than a set of competing answers that need to be voted on later.

Results & Findings

Task	Model	Baseline (single‑trace)	Majority Voting	ThinkMerge
AIME (closed‑ended)	GPT‑4	78.4 %	80.1 %	80.3 %
GPQA (closed‑ended)	LLaMA‑2‑13B	62.7 %	64.0 %	64.2 %
LiveCodeBench (hard)	DeepCoder‑14B‑Preview	31.5 % (pass@1)	38.2 %	39.8 %
LiveCodeBench (hard)	Qwen3‑8B	28.9 %	35.6 %	36.5 %
WebSailor‑7B (GAIA)	–	45.1 %	48.3 %	49.0 %

ThinkMerge consistently matches or slightly outperforms majority voting on closed‑ended QA.
The biggest wins appear on open‑ended generation (coding, web‑search), where voting over whole solutions is ill‑defined.
The method scales linearly with K (e.g., 4‑way parallelism ≈ 4× inference cost) but the performance boost often justifies the extra compute for high‑stakes applications.

Practical Implications

Developer tooling: IDE plugins or CI pipelines that rely on LLM‑generated code can adopt ThinkMerge to reduce flaky completions without retraining models.
Enterprise agents: Customer‑support bots, knowledge‑base search agents, and autonomous web‑scraping assistants can achieve higher reliability by running a few parallel traces and merging them on‑the‑fly.
Cost‑effective scaling: For teams that already provision GPU clusters for inference, ThinkMerge leverages existing hardware (parallel streams on the same GPU) and integrates with existing serving stacks (vLLM, SGLang).
Safety & consistency: Averaging logits tends to dampen extreme token probabilities, which can reduce hallucinations and toxic outputs—a useful side‑effect for safety‑critical deployments.
Plug‑and‑play: No model fine‑tuning or data‑centric changes are required; a single configuration flag can enable ThinkMerge in production services.

Limitations & Future Work

Compute overhead: Running K parallel traces multiplies inference cost, which may be prohibitive for latency‑sensitive applications.
Synchronization granularity: Choosing optimal sync points is non‑trivial; too frequent merging can diminish diversity, while too sparse merging may miss the benefits.
Model‑specific behavior: The gains vary across model families; some smaller models show marginal improvement, suggesting a ceiling effect.
Future directions: The authors propose adaptive K (dynamically adjusting the number of parallel streams), smarter sync heuristics based on token entropy, and exploring logit‑averaging during fine‑tuning to reduce runtime overhead.

ThinkMerge demonstrates that a modest amount of parallelism at inference time can unlock noticeable performance lifts for open‑ended reasoning tasks—offering a practical, low‑friction upgrade path for developers building next‑generation AI assistants and code‑generation tools.

Authors

Haonan Wang
Chao Du
Kenji Kawaguchi
Tianyu Pang

Paper Information

arXiv ID: 2512.02874v1
Categories: cs.CL
Published: December 2, 2025
PDF: Download PDF

[Paper] Think in Parallel, Answer as One: Logit Averaging for Open-Ended Reasoning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Enhancing Retrieval-Augmented Generation with Entity Linking for Educational Platforms

[Paper] M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

[Paper] Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding

[Paper] To Err Is Human: Systematic Quantification of Errors in Published AI Papers via LLM Analysis