[Paper] Stepwise Think-Critique: A Unified Framework for Robust and Interpretable LLM Reasoning
Source: arXiv - 2512.15662v1
Overview
The paper introduces Stepwise Think‑Critique (STC), a new training framework that teaches a single large language model (LLM) to reason and self‑evaluate simultaneously, step by step. By weaving a “critique” phase into every reasoning turn, STC mimics how humans iteratively check their own thoughts, leading to more reliable and transparent problem‑solving—especially on tough math and logic tasks.
Key Contributions
- Unified reasoning‑and‑critique loop: STC interleaves a “think” step (generating a reasoning fragment) with a “critique” step (self‑checking that fragment) inside the same model, eliminating the need for separate verifier modules.
- Hybrid reinforcement‑learning objective: The authors combine a standard reasoning reward (correctness of the final answer) with a critique‑consistency reward that encourages the model’s self‑critiques to align with the eventual outcome.
- Interpretability boost: The alternating think/critique trace is human‑readable, making it easier to debug why a model succeeded or failed.
- Strong empirical gains: On several mathematical reasoning benchmarks (e.g., GSM8K, MATH), STC outperforms strong baselines that use either pure chain‑of‑thought prompting or post‑hoc verification.
- Proof‑of‑concept for “critical thinking” LLMs: Demonstrates that a single model can learn to evaluate its own reasoning without external tools, a step toward more autonomous AI assistants.
Methodology
-
Prompt design: Each inference round is split into two sub‑prompts:
- Think: “Generate the next reasoning step to solve the problem.”
- Critique: “Check the just‑generated step for logical errors, missing pieces, or contradictions.”
The model receives both the problem statement and the previous think/critique pairs as context.
-
Training data: The authors construct a synthetic dataset where each solution is annotated with both the correct reasoning steps and a human‑written critique for each step.
-
Hybrid RL fine‑tuning:
- Reasoning reward (
R₁): Positive when the final answer matches the ground truth. - Critique‑consistency reward (
R₂): Positive when the model’s critique correctly predicts whether the current step will lead to a correct final answer. - The total reward is a weighted sum
R = λ·R₁ + (1‑λ)·R₂. Proximal Policy Optimization (PPO) is used to update the model.
- Reasoning reward (
-
Inference: At test time the model alternates between think and critique until it emits a
STOPtoken, then produces the final answer. No external verifier is needed.
Results & Findings
| Benchmark | Baseline (Chain‑of‑Thought) | Baseline + Post‑hoc Verifier | STC |
|---|---|---|---|
| GSM8K | 71.2 % | 73.8 % | 78.5 % |
| MATH (level‑1) | 38.4 % | 41.1 % | 46.9 % |
| MATH (level‑2) | 21.7 % | 24.3 % | 30.2 % |
- Higher accuracy: STC consistently beats both pure reasoning and reasoning‑plus‑verifier pipelines, especially on harder problems where step‑wise self‑checking matters most.
- More interpretable traces: Human evaluators rated STC’s reasoning logs as clearer and easier to follow than those from standard chain‑of‑thought models.
- Robustness to prompt variations: Because critique is learned jointly, the model is less sensitive to minor wording changes in the prompt.
Practical Implications
- Simpler AI stacks: Developers can replace a two‑model architecture (reasoner + external verifier) with a single STC‑enabled model, reducing latency, memory footprint, and engineering overhead.
- Debuggable assistants: The think‑critique transcript acts like a built‑in audit log, helping engineers pinpoint where a model went wrong without needing separate tracing tools.
- Safer code generation & data analysis: Critical‑thinking loops can be applied to any domain where stepwise correctness matters—e.g., generating SQL queries, composing API calls, or performing symbolic math in scientific notebooks.
- Better user experience: End‑users can be shown the model’s self‑critiques, increasing trust (e.g., “I think this step might be off because …”).
- Foundation for autonomous agents: Future agents that plan and act in the world could embed STC‑style self‑evaluation to catch planning errors before execution, lowering the risk of costly mistakes.
Limitations & Future Work
- Training data bottleneck: The current approach relies on manually annotated critiques for each reasoning step, which is expensive to scale to broader domains.
- Computation cost: Alternating think/critique doubles the number of forward passes per inference step, increasing latency compared with a single chain‑of‑thought pass.
- Domain transfer: Experiments focus on math; it remains open how well STC generalizes to non‑numeric reasoning (e.g., legal reasoning, code synthesis).
- Future directions: The authors suggest exploring self‑generated critiques (bootstrapping without human labels), curriculum learning for longer reasoning chains, and integrating STC with tool‑use APIs (e.g., calculators, code interpreters) to further boost robustness.
Authors
- Jiaqi Xu
- Cuiling Lan
- Xuejin Chen
- Yan LU
Paper Information
- arXiv ID: 2512.15662v1
- Categories: cs.AI
- Published: December 17, 2025
- PDF: Download PDF