[Paper] Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning
Source: arXiv - 2512.16917v1
Overview
The paper introduces Generative Adversarial Reasoner (GAR), a novel training framework that pairs a large language model (LLM) acting as a “reasoner” with another LLM acting as a “discriminator.” By letting the two models compete and cooperate through adversarial reinforcement learning, GAR supplies dense, step‑level feedback that dramatically improves the logical consistency and accuracy of LLM‑generated mathematical reasoning.
Key Contributions
- Joint adversarial training of a reasoning LLM and a discriminator LLM, providing on‑policy, fine‑grained rewards for each reasoning step.
- Compute‑efficient review schedule that splits a reasoning chain into equally sized, logically complete slices, enabling the discriminator to evaluate each slice with concise, structured justifications.
- Dense reward signal that complements the usual sparse exact‑match reward, improving credit assignment and sample efficiency during RL fine‑tuning.
- Empirical gains on hard math benchmarks (e.g., AIME‑24) that surpass strong baselines by up to +10 absolute points.
- Modular discriminator design that can be re‑used for other objectives such as teacher distillation, preference alignment, or proof‑style reasoning.
Methodology
- Reasoner LLM generates a multi‑step solution to a problem (e.g., a math question).
- The solution is segmented into “slices” of comparable length (e.g., 2–3 inference steps) using a deterministic schedule that ensures each slice forms a self‑contained logical unit.
- Discriminator LLM receives each slice and produces a short justification plus a binary judgment: valid vs. invalid.
- Adversarial RL loop:
- The reasoner receives a reward for each slice that the discriminator marks as valid and that ultimately leads to the correct final answer.
- The discriminator receives a reward for correctly spotting errors or confirming correct slices.
- The two models are updated on‑policy (i.e., using the current policy’s own outputs), which yields dense, step‑level feedback rather than waiting for the final answer.
- Standard RL techniques (e.g., PPO) are applied, but the reward shaping is now much richer thanks to the discriminator’s judgments.
Results & Findings
| Model (baseline) | AIME‑24 Score | GAR‑enhanced Score | Δ |
|---|---|---|---|
| DeepSeek‑R1‑Distill‑Qwen‑7B | 54.0 | 61.3 | +7.3 |
| DeepSeek‑R1‑Distill‑Llama‑8B | 43.7 | 53.7 | +10.0 |
- Across several other math datasets (e.g., GSM‑8K, MATH), GAR consistently outperformed strong RL‑fine‑tuned baselines.
- Ablation studies showed that slice‑level rewards contributed the most to performance gains, confirming the importance of dense feedback.
- The discriminator remained lightweight (≈0.5 B parameters) yet achieved high detection accuracy, indicating that a full‑scale LLM is not required for the adversarial role.
Practical Implications
- Better debugging tools: The discriminator’s structured justifications can be exposed to developers as a “reasoning audit” that pinpoints exactly where a model went wrong.
- Higher‑quality code generation: By treating each line or block of generated code as a slice, GAR can be adapted to catch logical bugs early, improving reliability of LLM‑assisted programming assistants.
- Efficient fine‑tuning: Dense rewards reduce the number of samples needed to achieve a target accuracy, cutting compute costs for organizations that fine‑tune proprietary LLMs.
- Customizable reward shaping: Because the discriminator is modular, teams can plug in domain‑specific criteria (e.g., security constraints, style guides) without retraining the whole reasoner from scratch.
- Teacher‑student distillation: A high‑performing discriminator can serve as a “teacher” that guides smaller student models toward more sound reasoning, enabling lightweight deployment.
Limitations & Future Work
- Dependence on slice quality: The current schedule assumes that logical steps can be neatly partitioned; highly interdependent reasoning may suffer from fragmented evaluation.
- Discriminator capacity: While lightweight, the discriminator can still misclassify subtle errors, propagating noisy rewards to the reasoner.
- Domain transfer: Experiments focus on mathematical reasoning; applying GAR to natural‑language tasks (e.g., commonsense reasoning) may require redesigning slice definitions and justification formats.
- Scalability to very large models: Training two LLMs jointly doubles memory footprints; future work could explore parameter‑sharing or knowledge‑distillation tricks to mitigate this.
The authors suggest exploring adaptive slice lengths, multi‑modal discriminators (e.g., code + execution traces), and integrating human‑in‑the‑loop feedback to further tighten the adversarial loop.
Authors
- Qihao Liu
- Luoxin Ye
- Wufei Ma
- Yu-Cheng Chou
- Alan Yuille
Paper Information
- arXiv ID: 2512.16917v1
- Categories: cs.AI, cs.CL, cs.LG
- Published: December 18, 2025
- PDF: Download PDF