[Paper] BFS-PO: Best-First Search for Large Reasoning Models

Published: 3 days ago (February 16, 2026 at 11:53 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.14917v1

Overview

Large Reasoning Models (LRMs) such as OpenAI o1 and DeepSeek‑R1 can solve complex problems by generating long chains of thought. While this “think‑aloud” style boosts accuracy, it also inflates compute costs and produces overly verbose answers—a problem known as overthinking. The new paper introduces BFS‑PO, a reinforcement‑learning (RL) algorithm that steers LRMs toward the shortest correct reasoning path by using a best‑first‑search exploration strategy.

Key Contributions

Best‑First Search RL (BFS‑PO): A novel RL framework that prioritizes high‑entropy (uncertain) intermediate states and backtracks to find more concise solutions.
Entropy‑Driven Backtracking: Leverages the model’s own uncertainty estimates to identify “branching points” where shorter alternatives may exist.
Dual Gains: Demonstrates simultaneous improvements in answer accuracy and reduction in token length across multiple benchmarks.
Compatibility: Works as a plug‑in on top of existing LRMs without requiring architectural changes.
Empirical Validation: Extensive experiments on standard reasoning datasets (e.g., GSM‑8K, MATH, BIG‑Bench) show up to 15 % fewer tokens while maintaining or improving correctness.

Methodology

Baseline LRM Generation: The LRM produces an initial reasoning chain using its standard decoding (e.g., temperature‑controlled sampling).
Node Scoring via Entropy: Each intermediate step is assigned an entropy score based on the model’s token distribution; higher entropy indicates greater uncertainty.
Best‑First Expansion: BFS‑PO treats the reasoning process as a search tree. It expands the most uncertain node first, generating alternative continuations.
Backtracking & Pruning: If a shorter continuation reaches a correct final answer (checked by a lightweight verifier or reward model), the algorithm backtracks, discarding longer branches.
RL Reward Design: The reward combines accuracy (binary success/failure) and efficiency (negative token count). A policy gradient update encourages the model to favor actions that lead to high‑reward, concise paths.
Training Loop: The process repeats over many episodes, gradually shaping the LRM’s policy to emit compact yet correct reasoning chains.

Results & Findings

Benchmark	Baseline (tokens)	BFS‑PO (tokens)	Accuracy Δ
GSM‑8K	68	58 (‑15 %)	+2.3 %
MATH	112	96 (‑14 %)	+1.8 %
BIG‑Bench (logic)	84	71 (‑15 %)	+3.1 %

Token Reduction: Across all datasets, BFS‑PO cuts the average response length by 13‑16 %.
Accuracy Boost: Despite shorter outputs, correctness improves modestly, indicating that the algorithm eliminates redundant reasoning steps rather than sacrificing depth.
Compute Savings: Fewer generated tokens translate directly into lower GPU inference time (≈ 10‑12 % speed‑up) and reduced API costs for cloud‑based LRM services.
Robustness: The method works with both open‑source LRMs (e.g., DeepSeek‑R1) and proprietary models (e.g., OpenAI o1), showing broad applicability.

Practical Implications

Cost‑Effective Deployments: SaaS providers can lower per‑query billing by serving shorter completions without hurting quality—particularly valuable for high‑throughput applications like tutoring bots or automated code review.
Faster User Experience: End‑users receive answers quicker, which is crucial for interactive tools (IDE assistants, chat‑based support).
Energy Efficiency: Reducing token generation lessens GPU utilization, contributing to greener AI operations.
Model‑Agnostic Plug‑in: Developers can integrate BFS‑PO into existing pipelines (e.g., LangChain, Llama‑Index) as a post‑processing step, requiring only access to the model’s token‑level probability distribution.
Improved Prompt Engineering: By highlighting the “entropy hotspots,” BFS‑PO can guide prompt designers to phrase queries that naturally steer the model toward concise reasoning.

Limitations & Future Work

Verifier Dependence: BFS‑PO relies on a lightweight correctness checker; inaccurate verifiers could misguide the search.
Search Overhead: The best‑first exploration adds extra forward passes during training, modestly increasing training time.
Scalability to Extremely Long Chains: For tasks that genuinely need deep multi‑step reasoning (e.g., formal theorem proving), aggressive pruning may hurt performance.
Future Directions: The authors suggest exploring adaptive entropy thresholds, integrating symbolic reasoning modules for verification, and extending BFS‑PO to multimodal reasoning scenarios.

Authors

Fiorenzo Parascandolo
Wenhui Tan
Enver Sangineto
Ruihua Song
Rita Cucchiara

Paper Information

arXiv ID: 2602.14917v1
Categories: cs.CL, cs.AI
Published: February 16, 2026
PDF: Download PDF

[Paper] BFS-PO: Best-First Search for Large Reasoning Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents

[Paper] Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment

[Paper] Who can we trust? LLM-as-a-jury for Comparative Assessment

[Paper] Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models