[Paper] Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning

Published: (December 18, 2025 at 01:59 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.16917v1

Overview

The paper introduces Generative Adversarial Reasoner (GAR), a novel training framework that pairs a large language model (LLM) acting as a “reasoner” with another LLM acting as a “discriminator.” By letting the two models compete and cooperate through adversarial reinforcement learning, GAR supplies dense, step‑level feedback that dramatically improves the logical consistency and accuracy of LLM‑generated mathematical reasoning.

Key Contributions

  • Joint adversarial training of a reasoning LLM and a discriminator LLM, providing on‑policy, fine‑grained rewards for each reasoning step.
  • Compute‑efficient review schedule that splits a reasoning chain into equally sized, logically complete slices, enabling the discriminator to evaluate each slice with concise, structured justifications.
  • Dense reward signal that complements the usual sparse exact‑match reward, improving credit assignment and sample efficiency during RL fine‑tuning.
  • Empirical gains on hard math benchmarks (e.g., AIME‑24) that surpass strong baselines by up to +10 absolute points.
  • Modular discriminator design that can be re‑used for other objectives such as teacher distillation, preference alignment, or proof‑style reasoning.

Methodology

  1. Reasoner LLM generates a multi‑step solution to a problem (e.g., a math question).
  2. The solution is segmented into “slices” of comparable length (e.g., 2–3 inference steps) using a deterministic schedule that ensures each slice forms a self‑contained logical unit.
  3. Discriminator LLM receives each slice and produces a short justification plus a binary judgment: valid vs. invalid.
  4. Adversarial RL loop:
    • The reasoner receives a reward for each slice that the discriminator marks as valid and that ultimately leads to the correct final answer.
    • The discriminator receives a reward for correctly spotting errors or confirming correct slices.
  5. The two models are updated on‑policy (i.e., using the current policy’s own outputs), which yields dense, step‑level feedback rather than waiting for the final answer.
  6. Standard RL techniques (e.g., PPO) are applied, but the reward shaping is now much richer thanks to the discriminator’s judgments.

Results & Findings

Model (baseline)AIME‑24 ScoreGAR‑enhanced ScoreΔ
DeepSeek‑R1‑Distill‑Qwen‑7B54.061.3+7.3
DeepSeek‑R1‑Distill‑Llama‑8B43.753.7+10.0
  • Across several other math datasets (e.g., GSM‑8K, MATH), GAR consistently outperformed strong RL‑fine‑tuned baselines.
  • Ablation studies showed that slice‑level rewards contributed the most to performance gains, confirming the importance of dense feedback.
  • The discriminator remained lightweight (≈0.5 B parameters) yet achieved high detection accuracy, indicating that a full‑scale LLM is not required for the adversarial role.

Practical Implications

  • Better debugging tools: The discriminator’s structured justifications can be exposed to developers as a “reasoning audit” that pinpoints exactly where a model went wrong.
  • Higher‑quality code generation: By treating each line or block of generated code as a slice, GAR can be adapted to catch logical bugs early, improving reliability of LLM‑assisted programming assistants.
  • Efficient fine‑tuning: Dense rewards reduce the number of samples needed to achieve a target accuracy, cutting compute costs for organizations that fine‑tune proprietary LLMs.
  • Customizable reward shaping: Because the discriminator is modular, teams can plug in domain‑specific criteria (e.g., security constraints, style guides) without retraining the whole reasoner from scratch.
  • Teacher‑student distillation: A high‑performing discriminator can serve as a “teacher” that guides smaller student models toward more sound reasoning, enabling lightweight deployment.

Limitations & Future Work

  • Dependence on slice quality: The current schedule assumes that logical steps can be neatly partitioned; highly interdependent reasoning may suffer from fragmented evaluation.
  • Discriminator capacity: While lightweight, the discriminator can still misclassify subtle errors, propagating noisy rewards to the reasoner.
  • Domain transfer: Experiments focus on mathematical reasoning; applying GAR to natural‑language tasks (e.g., commonsense reasoning) may require redesigning slice definitions and justification formats.
  • Scalability to very large models: Training two LLMs jointly doubles memory footprints; future work could explore parameter‑sharing or knowledge‑distillation tricks to mitigate this.

The authors suggest exploring adaptive slice lengths, multi‑modal discriminators (e.g., code + execution traces), and integrating human‑in‑the‑loop feedback to further tighten the adversarial loop.

Authors

  • Qihao Liu
  • Luoxin Ye
  • Wufei Ma
  • Yu-Cheng Chou
  • Alan Yuille

Paper Information

  • arXiv ID: 2512.16917v1
  • Categories: cs.AI, cs.CL, cs.LG
  • Published: December 18, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

[Paper] When Reasoning Meets Its Laws

Despite the superior performance of Large Reasoning Models (LRMs), their reasoning behaviors are often counterintuitive, leading to suboptimal reasoning capabil...