[Paper] Reasoning While Asking: Transforming Reasoning Large Language Models from Passive Solvers to Proactive Inquirers

Published: (January 29, 2026 at 01:56 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.22139v1

Overview

The paper introduces Proactive Interactive Reasoning (PIR), a new paradigm that turns reasoning‑focused large language models (LLMs) from passive “think‑alone” systems into active inquirers that ask clarification questions when they hit ambiguous or missing information. By blending reasoning with user interaction, PIR tackles uncertainty at the premise and intent level—something traditional chain‑of‑thought (CoT) or tool‑augmented approaches don’t address.

Key Contributions

  • Proactive Interaction Paradigm: Shifts LLMs from blind self‑thinking to an interactive loop that interleaves reasoning steps with clarification queries.
  • Uncertainty‑Aware Fine‑Tuning: A supervised fine‑tuning stage that teaches the model to recognize when it lacks sufficient information and to formulate useful questions.
  • Policy Optimization with User Simulator: Uses a simulated user to train a policy that balances asking questions, solving the task, and respecting user intent, guided by a composite reward (accuracy, efficiency, user satisfaction).
  • Broad Empirical Validation: Demonstrates consistent gains across three domains—mathematical problem solving, code generation, and document editing—outperforming strong baselines by up to 32.7 % accuracy, 22.9 % pass rate, and 41.36 BLEU points.
  • Efficiency Gains: Cuts nearly 50 % of reasoning compute and reduces unnecessary interaction turns, making the system faster and cheaper to run.
  • Robust Generalization: Shows strong performance on out‑of‑distribution tasks such as factual QA, missing‑premise reasoning, and knowledge‑uncertainty scenarios.

Methodology

  1. Uncertainty Detection

    • The model is first fine‑tuned on a curated dataset where each reasoning step is labeled with an “uncertainty flag” indicating whether the model should continue reasoning or ask a question.
    • Features such as low confidence scores, contradictory evidence, or missing variables trigger the flag.
  2. Interactive Reasoning Loop

    • Step 1 – Reason: The LLM generates a partial reasoning trace.
    • Step 2 – Evaluate: A lightweight classifier checks the uncertainty flag.
    • Step 3 – Query (if needed): The model produces a concise clarification question aimed at the user (or simulated user).
    • Step 4 – Incorporate Answer: The user’s response is appended to the context, and the model resumes reasoning.
  3. Policy Optimization

    • A user simulator mimics realistic answers (including occasional misunderstandings) to enable large‑scale training without human labor.
    • A composite reward combines task accuracy, number of interaction turns, and a “user‑intent alignment” score.
    • Reinforcement learning (e.g., PPO) updates the model’s policy to ask the right questions at the right time.
  4. Evaluation Suite

    • Benchmarks span MATH (symbolic math), HumanEval (code generation), and DocEdit (document editing).
    • Additional reliability tests probe factual correctness and handling of missing premises.

Results & Findings

DomainBaseline (CoT)PIRAccuracy ↑Pass Rate ↑BLEU ↑Reasoning Compute ↓
Math (MATH)58.1 %77.6 %+32.7 %~‑48 %
Code (HumanEval)45.3 %58.9 %+13.6 %+22.9 %~‑45 %
Document Editing61.2 %73.8 %+12.6 %+41.36~‑50 %
  • Interaction Efficiency: Average number of clarification turns dropped from 3.8 (baseline) to 2.1, showing the model learns to ask fewer, more informative questions.
  • Generalization: On unseen factual QA sets, PIR maintained a +9 % accuracy lift over CoT, indicating the uncertainty‑aware policy transfers beyond the training domains.
  • Ablation: Removing the uncertainty‑aware fine‑tuning or the RL‑based policy each caused a 10‑15 % drop, confirming both components are essential.

Practical Implications

  • Developer Assistants: IDE plugins can embed PIR‑enabled LLMs that ask developers for missing specifications (e.g., “What should the function return when input is empty?”) before generating code, reducing bugs and re‑writes.
  • Customer‑Facing Bots: Support chatbots can proactively clarify ambiguous user requests, leading to higher resolution rates without escalating to human agents.
  • Data‑Cleaning & ETL Pipelines: Automated scripts can query data owners when encountering missing fields, making pipelines more resilient to incomplete datasets.
  • Education Tech: Tutoring systems can detect when a student’s answer lacks a key premise and ask targeted hints, improving learning outcomes.
  • Cost Savings: Halving the reasoning compute translates directly into lower cloud inference costs, especially for large models (e.g., 70B‑parameter LLMs) used at scale.

Limitations & Future Work

  • User Simulator Fidelity: The current simulator may not capture the full variability of real‑world user responses, potentially over‑optimizing for ideal interactions.
  • Latency Overhead: While fewer reasoning steps are needed, each interaction introduces a round‑trip latency that could affect real‑time applications.
  • Domain‑Specific Prompting: The uncertainty detection fine‑tuning was performed on a limited set of tasks; extending to highly specialized domains (e.g., legal reasoning) may require additional data.

Future Directions

  • Incorporate human‑in‑the‑loop reinforcement learning to refine the policy with real user feedback.
  • Explore multi‑turn negotiation strategies where the model can refine its own questions based on partial answers.
  • Combine PIR with external tool use (e.g., calculators, code interpreters) to handle both knowledge gaps and premise uncertainties simultaneously.

Authors

  • Xin Chen
  • Feng Jiang
  • Yiqian Zhang
  • Hardy Chen
  • Shuo Yan
  • Wenya Xie
  • Min Yang
  • Shujian Huang

Paper Information

  • arXiv ID: 2601.22139v1
  • Categories: cs.CL, cs.AI
  • Published: January 29, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »