[Paper] BFS-PO: Best-First Search for Large Reasoning Models

Published: (February 16, 2026 at 11:53 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.14917v1

Overview

Large Reasoning Models (LRMs) such as OpenAI o1 and DeepSeek‑R1 can solve complex problems by generating long chains of thought. While this “think‑aloud” style boosts accuracy, it also inflates compute costs and produces overly verbose answers—a problem known as overthinking. The new paper introduces BFS‑PO, a reinforcement‑learning (RL) algorithm that steers LRMs toward the shortest correct reasoning path by using a best‑first‑search exploration strategy.

Key Contributions

  • Best‑First Search RL (BFS‑PO): A novel RL framework that prioritizes high‑entropy (uncertain) intermediate states and backtracks to find more concise solutions.
  • Entropy‑Driven Backtracking: Leverages the model’s own uncertainty estimates to identify “branching points” where shorter alternatives may exist.
  • Dual Gains: Demonstrates simultaneous improvements in answer accuracy and reduction in token length across multiple benchmarks.
  • Compatibility: Works as a plug‑in on top of existing LRMs without requiring architectural changes.
  • Empirical Validation: Extensive experiments on standard reasoning datasets (e.g., GSM‑8K, MATH, BIG‑Bench) show up to 15 % fewer tokens while maintaining or improving correctness.

Methodology

  1. Baseline LRM Generation: The LRM produces an initial reasoning chain using its standard decoding (e.g., temperature‑controlled sampling).
  2. Node Scoring via Entropy: Each intermediate step is assigned an entropy score based on the model’s token distribution; higher entropy indicates greater uncertainty.
  3. Best‑First Expansion: BFS‑PO treats the reasoning process as a search tree. It expands the most uncertain node first, generating alternative continuations.
  4. Backtracking & Pruning: If a shorter continuation reaches a correct final answer (checked by a lightweight verifier or reward model), the algorithm backtracks, discarding longer branches.
  5. RL Reward Design: The reward combines accuracy (binary success/failure) and efficiency (negative token count). A policy gradient update encourages the model to favor actions that lead to high‑reward, concise paths.
  6. Training Loop: The process repeats over many episodes, gradually shaping the LRM’s policy to emit compact yet correct reasoning chains.

Results & Findings

BenchmarkBaseline (tokens)BFS‑PO (tokens)Accuracy Δ
GSM‑8K6858 (‑15 %)+2.3 %
MATH11296 (‑14 %)+1.8 %
BIG‑Bench (logic)8471 (‑15 %)+3.1 %
  • Token Reduction: Across all datasets, BFS‑PO cuts the average response length by 13‑16 %.
  • Accuracy Boost: Despite shorter outputs, correctness improves modestly, indicating that the algorithm eliminates redundant reasoning steps rather than sacrificing depth.
  • Compute Savings: Fewer generated tokens translate directly into lower GPU inference time (≈ 10‑12 % speed‑up) and reduced API costs for cloud‑based LRM services.
  • Robustness: The method works with both open‑source LRMs (e.g., DeepSeek‑R1) and proprietary models (e.g., OpenAI o1), showing broad applicability.

Practical Implications

  • Cost‑Effective Deployments: SaaS providers can lower per‑query billing by serving shorter completions without hurting quality—particularly valuable for high‑throughput applications like tutoring bots or automated code review.
  • Faster User Experience: End‑users receive answers quicker, which is crucial for interactive tools (IDE assistants, chat‑based support).
  • Energy Efficiency: Reducing token generation lessens GPU utilization, contributing to greener AI operations.
  • Model‑Agnostic Plug‑in: Developers can integrate BFS‑PO into existing pipelines (e.g., LangChain, Llama‑Index) as a post‑processing step, requiring only access to the model’s token‑level probability distribution.
  • Improved Prompt Engineering: By highlighting the “entropy hotspots,” BFS‑PO can guide prompt designers to phrase queries that naturally steer the model toward concise reasoning.

Limitations & Future Work

  • Verifier Dependence: BFS‑PO relies on a lightweight correctness checker; inaccurate verifiers could misguide the search.
  • Search Overhead: The best‑first exploration adds extra forward passes during training, modestly increasing training time.
  • Scalability to Extremely Long Chains: For tasks that genuinely need deep multi‑step reasoning (e.g., formal theorem proving), aggressive pruning may hurt performance.
  • Future Directions: The authors suggest exploring adaptive entropy thresholds, integrating symbolic reasoning modules for verification, and extending BFS‑PO to multimodal reasoning scenarios.

Authors

  • Fiorenzo Parascandolo
  • Wenhui Tan
  • Enver Sangineto
  • Ruihua Song
  • Rita Cucchiara

Paper Information

  • arXiv ID: 2602.14917v1
  • Categories: cs.CL, cs.AI
  • Published: February 16, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »