[Paper] Escaping the Verifier: Learning to Reason via Demonstrations

Published: (November 26, 2025 at 01:42 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2511.21667v1

Overview

The paper “Escaping the Verifier: Learning to Reason via Demonstrations” tackles a core bottleneck in training large language models (LLMs) for complex reasoning: most existing pipelines rely on a task‑specific verifier that can automatically judge whether a model’s answer is correct. In many real‑world settings such verifiers simply don’t exist, even though we often have plenty of high‑quality expert solutions (e.g., solved math problems, code reviews, or poetry drafts). The authors propose a new framework—RA​RO (Relativistic Adversarial Reasoning Optimization)—that learns to reason directly from these demonstrations using inverse reinforcement learning, without any external verifier.

Key Contributions

  • Verifier‑free reasoning training: Introduces a method that eliminates the need for handcrafted reward models or automatic correctness checkers.
  • Adversarial relativistic critic: Designs a discriminator that learns to compare policy outputs against expert demonstrations rather than assigning absolute scores, stabilizing training.
  • Joint policy‑critic RL loop: Simultaneously updates the reasoning policy (generator) and the relativistic critic using reinforcement learning, enabling continuous improvement.
  • Stabilization toolkit: Identifies and empirically validates a set of tricks (e.g., reward clipping, curriculum pacing, entropy regularization) that make the adversarial RL loop robust.
  • Strong empirical results: Demonstrates consistent gains over verifier‑free baselines on three diverse benchmarks—Countdown (numeric reasoning), DeepMath (formal theorem proving), and Poetry Writing (creative generation).
  • Scalable performance: Shows that RARO’s scaling behavior mirrors that of verifier‑based RL, suggesting it can benefit from larger models and data.

Methodology

  1. Data assumption: Only a set of expert demonstrations (input → high‑quality answer) is required; no ground‑truth labels or automatic checkers are needed.
  2. Policy (generator): An LLM fine‑tuned to produce answers given a prompt. It is treated as an RL agent whose actions are token selections.
  3. Relativistic critic (discriminator): Instead of outputting a scalar “correctness” score, the critic receives pairs of answers—one from the policy, one from an expert—and learns to assign higher probability to the expert answer. This relativistic formulation encourages the policy to close the gap with experts.
  4. Adversarial RL loop:
    • The policy samples an answer for a given prompt.
    • The critic evaluates the (policy, expert) pair and returns a reward signal based on its confidence that the expert answer is better.
    • The policy updates via a policy‑gradient method (e.g., PPO) using this reward.
    • Simultaneously, the critic updates its parameters to better discriminate future pairs.
  5. Stabilization tricks:
    • Reward normalization to keep gradients in a sane range.
    • Curriculum sampling that gradually increases prompt difficulty.
    • Entropy bonuses to avoid premature mode collapse.
    • Replay buffer of past policy outputs to diversify critic training.

Results & Findings

BenchmarkBaseline (verifier‑free)RARORelative improvement
Countdown (numeric)62.4 % exact match78.9 %+26 %
DeepMath (theorem proving)41.1 % solved57.3 %+39 %
Poetry Writing (BLEU‑4)21.734.5+59 %
  • Consistent scaling: When the model size doubled (e.g., from 7B to 13B parameters), RARO’s performance gains roughly doubled, mirroring trends seen in verifier‑based RL.
  • Ablation studies: Removing the relativistic component or the reward clipping caused training to diverge in under 10 % of runs, confirming the necessity of the proposed stabilization techniques.
  • Qualitative analysis: Generated solutions on Countdown exhibit step‑by‑step reasoning comparable to textbook solutions, and poetry samples show richer metaphorical structure than baseline generations.

Practical Implications

  • Deployable reasoning agents: Companies can now fine‑tune LLMs for domains where correctness is hard to auto‑verify (e.g., legal reasoning, scientific hypothesis generation) using only curated expert examples.
  • Reduced engineering overhead: No need to build and maintain task‑specific verifiers, which often require domain experts and continuous updates.
  • Data efficiency: Existing corpora of solved problems, code reviews, or editorial drafts become directly usable for RL‑style reasoning training.
  • Integration with existing pipelines: RARO can be wrapped around any decoder‑only LLM and combined with standard fine‑tuning, making it a drop‑in upgrade for teams already using RLHF.
  • Safety & alignment: By learning from human‑approved demonstrations rather than a proxy verifier, the model’s reasoning aligns more closely with expert intent, potentially reducing hallucinations in high‑stakes applications.

Limitations & Future Work

  • Demo quality dependence: The approach assumes that the demonstration set is both high‑quality and representative; noisy or biased demos can mislead the critic.
  • Computational cost: Joint adversarial training still incurs the overhead of RL (multiple rollouts, critic updates), which may be prohibitive for very large models without specialized hardware.
  • Generalization to unseen domains: While scaling trends are promising, the paper notes a drop in performance when the test prompts differ substantially from the demo distribution, suggesting a need for domain‑adaptation strategies.
  • Future directions: The authors propose extending RARO to multi‑modal reasoning (e.g., code + diagram), exploring curriculum learning that automatically adapts to domain shifts, and investigating more efficient critic architectures.
Back to Blog

Related posts

Read more »