[Paper] Rectifying LLM Thought from Lens of Optimization

Published: (December 1, 2025 at 12:41 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.01925v1

Overview

The paper “Rectifying LLM Thought from Lens of Optimization” re‑examines chain‑of‑thought (CoT) prompting as an optimization problem rather than a purely linguistic one. By treating each reasoning step as a gradient‑descent update, the authors devise a post‑training technique—RePro (Rectifying Process‑level Reward)—that rewards LLMs for producing concise, stable, and goal‑directed reasoning traces. Experiments show that RePro consistently sharpens performance on math, science, and coding benchmarks while curbing “overthinking” behaviors.

Key Contributions

  • Optimization framing of CoT: Formalizes reasoning chains as iterative updates toward a solution, analogous to gradient descent.
  • Process‑level reward design: Introduces two complementary scores—intensity (how much each step reduces the residual error) and stability (variance of updates across steps)—combined into a single reward signal.
  • RePro integration with RLVR: Seamlessly plugs the process‑level reward into existing Reinforcement Learning with Verifiable Rewards pipelines, enabling fine‑tuning without altering the underlying model architecture.
  • Broad empirical validation: Demonstrates gains across multiple RL algorithms (PPO, DPO, RLAIF) and a variety of LLM sizes (7B‑70B) on benchmark suites such as MATH, GSM‑8K, ScienceQA, and HumanEval.
  • Mitigation of overthinking: Shows that RePro reduces excessively long reasoning chains while preserving or improving answer correctness.

Methodology

  1. Viewing CoT as Gradient Descent

    • Each token or reasoning step is interpreted as an update ( \theta_{t+1} = \theta_t - \eta \nabla L_t ) that moves the model’s internal “state” closer to the correct answer.
    • The authors define a surrogate loss ( \tilde{L}_t ) based on the distance between the current partial answer and the ground‑truth solution.
  2. Scoring the Optimization Process

    • Intensity Score: Measures the reduction in surrogate loss between consecutive steps (larger reduction → higher intensity).
    • Stability Score: Computes the variance of intensity across the chain; low variance indicates a steady, purposeful reasoning trajectory.
  3. Composite Process‑level Reward
    [ R_{\text{process}} = \lambda_{\text{int}} \cdot \text{Intensity} + \lambda_{\text{stab}} \cdot \text{Stability} ]
    Hyper‑parameters ( \lambda_{\text{int}}, \lambda_{\text{stab}} ) are tuned to balance brevity and thoroughness.

  4. Integration with RLVR
    [ \max_{\pi} ; \mathbb{E}{\pi}\big[ R{\text{task}} + R_{\text{process}} - \beta , \text{KL}(\pi | \pi_{\text{ref}}) \big] ]
    This encourages the policy to generate reasoning traces that are both correct and optimization‑efficient.

  5. Training Pipeline

    • Start from a pre‑trained LLM, collect CoT demonstrations, compute process scores on‑the‑fly, and fine‑tune with PPO (or other RL algorithms) using the augmented reward.

Results & Findings

Model / RL Alg.Baseline (Task‑only)+ ReProΔ AccuracyAvg. CoT Length ↓
LLaMA‑13B + PPO68.2 % (MATH)71.9 %+3.7 %–12 %
GPT‑Neo‑6B + DPO61.5 % (GSM‑8K)64.8 %+3.3 %–15 %
CodeLlama‑34B + RLAIF78.4 % (HumanEval)81.2 %+2.8 %–9 %
  • Consistent gains across domains: mathematics (MATH, GSM‑8K), science (ScienceQA), and programming (HumanEval).
  • Reduced overthinking: Average chain‑of‑thought length shrank by 9‑15 % without sacrificing answer quality.
  • Stability: Variance of intensity scores dropped, indicating smoother optimization trajectories.
  • Ablation: Removing either intensity or stability component degraded performance, confirming the need for both aspects.

Practical Implications

  • Sharper AI assistants: Developers can embed RePro‑fine‑tuned models in chatbots or coding assistants that give concise, well‑structured explanations, improving user trust and reducing latency.
  • Cost‑effective inference: Shorter reasoning chains translate to fewer token generations, lowering API usage costs and speeding up response times.
  • Better debugging tools: The process‑level scores can be exposed as diagnostics, helping engineers pinpoint where a model “gets stuck” during reasoning.
  • Cross‑task adaptability: Since RePro works as a plug‑in reward, it can be applied to any downstream RL‑based fine‑tuning pipeline (e.g., instruction following, tool use) without re‑architecting the model.
  • Safety & alignment: By discouraging endless speculation, RePro may reduce the risk of hallucinations that arise from overly long, unfocused CoT generations.

Limitations & Future Work

  • Surrogate loss design: The current proxy for reasoning progress relies on handcrafted distance metrics; more principled, task‑agnostic measures could improve robustness.
  • Scalability to extremely large models: Experiments were limited to ≤70 B parameters; it remains to be seen how RePro behaves with multi‑billion‑parameter systems under limited fine‑tuning budgets.
  • Generalization to non‑CoT tasks: The method assumes an explicit reasoning trace; applying a similar optimization lens to single‑shot or retrieval‑augmented generation is an open question.
  • Human evaluation: While automatic metrics improved, user studies on perceived explanation quality and trustworthiness are needed to validate real‑world impact.

Bottom line: RePro offers a practical, optimization‑driven recipe for making LLMs think more efficiently—an advance that can directly benefit developers building smarter, faster, and more reliable AI‑powered applications.

Authors

  • Junnan Liu
  • Hongwei Liu
  • Songyang Zhang
  • Kai Chen

Paper Information

  • arXiv ID: 2512.01925v1
  • Categories: cs.CL, cs.AI
  • Published: December 1, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »