[Paper] iGRPO: Self-Feedback-Driven LLM Reasoning
Source: arXiv - 2602.09000v1
Overview
The paper introduces iGRPO, a new reinforcement‑learning (RL) recipe that lets large language models (LLMs) talk to themselves while solving math problems. By generating a handful of candidate drafts, picking the most promising one, and then refining it in a second pass, iGRPO consistently pushes the model beyond its best prior attempt and sets new state‑of‑the‑art scores on challenging AIME benchmarks.
Key Contributions
- Iterative GRPO (iGRPO): Extends Group Relative Policy Optimization (GRPO) with a two‑stage self‑feedback loop that conditions refinements on the model’s own highest‑reward draft.
- Self‑conditioning via drafts: Uses the same scalar reward (e.g., correctness of the final answer) both to select the best draft and to guide the refinement policy, eliminating the need for a separate value function.
- Empirical gains across models: Demonstrates consistent improvements over vanilla GRPO on Nemotron‑H‑8B‑Base‑8K, DeepSeek‑R1 Distilled, and other 7‑10 B‑scale LLMs under identical rollout budgets.
- State‑of‑the‑art math reasoning: Achieves 85.62 % on AIME‑24 and 79.64 % on AIME‑25 with OpenReasoning‑Nemotron‑7B, surpassing all previously reported results.
- Ablation insights: Shows that the refinement wrapper works with other GRPO variants, benefits from a generative “judge” model, and mitigates premature entropy collapse during training.
Methodology
Stage 1 – Draft Generation
- The policy samples N exploratory completions (drafts) for a given math prompt.
- Each draft is scored with a single scalar reward (e.g., 1 for a correct final answer, 0 otherwise).
- The draft with the highest reward is selected as the best draft.
Stage 2 – Draft‑Conditioned Refinement
- The original prompt is concatenated with the best draft, forming a draft‑conditioned context.
- The policy generates a refined answer conditioned on this context.
- A GRPO‑style update is applied: the probability ratio between the refined answer and the original policy’s distribution is weighted by the same reward signal, but normalized relative to the group of sampled drafts (hence “group‑relative”).
- No value function is learned; the algorithm relies purely on reward‑based policy gradients.
Training Loop
- The two stages are repeated iteratively, allowing the model to continuously improve beyond its own strongest prior attempts.
- Entropy regularization is delayed, preventing the policy from collapsing to a deterministic output too early.
Results & Findings
| Model (base) | Benchmark | iGRPO Score | GRPO Baseline | Δ |
|---|---|---|---|---|
| Nemotron‑H‑8B‑Base‑8K | AIME‑24 | 85.62 % | 78.1 % | +7.5 % |
| DeepSeek‑R1 Distilled | AIME‑25 | 79.64 % | 71.3 % | +8.3 % |
| OpenReasoning‑Nemotron‑7B (trained on AceReason‑Math) | AIME‑24/25 | 85.62 % / 79.64 % | — | — |
- Consistent outperformance: Across all tested models, iGRPO beats GRPO with the same number of rollouts, confirming that the self‑feedback loop adds real value.
- Entropy dynamics: Delaying entropy loss keeps exploration alive longer, which the authors link to higher final accuracy.
- Generative judge advantage: Replacing a hand‑crafted reward with a small generative model that judges draft quality yields modest but consistent gains.
Practical Implications
- Better math‑oriented assistants: Developers building tutoring bots, automated proof assistants, or scientific calculators can plug iGRPO into existing LLM pipelines to obtain more reliable step‑by‑step reasoning without redesigning the reward model.
- Cost‑effective RL: Because iGRPO avoids training a separate value network, it reduces memory and compute overhead, making RL‑fine‑tuning feasible on commodity GPUs for 7‑10 B‑scale models.
- Generalizable self‑feedback: The draft‑conditioning idea can be transplanted to other domains—code generation, data‑to‑text, or dialogue—where a model can first produce a rough draft and then be asked to improve it using its own output as context.
- Simplified pipeline: The same scalar reward used for evaluation (e.g., test‑case pass/fail) can drive both draft selection and policy updates, streamlining the engineering stack.
Limitations & Future Work
- Reward granularity: The current setup relies on a binary correctness signal; richer, graded rewards (partial credit) could further boost learning but were not explored.
- Scalability of drafts: Sampling many drafts improves selection quality but linearly increases inference cost; adaptive draft budgeting is an open question.
- Domain transfer: While math reasoning benefits from precise correctness checks, it remains unclear how well iGRPO transfers to tasks with more subjective rewards (e.g., creative writing).
- Long‑term consistency: The method focuses on single‑prompt refinement; extending the iterative loop across multi‑turn conversations or multi‑step problem chains is a promising direction.
Bottom line: iGRPO shows that a simple “self‑feedback” loop—generate, pick the best, refine—can turn existing LLMs into stronger reasoners with modest RL machinery, opening the door for more dependable AI assistants in math, code, and beyond.
Authors
- Ali Hatamizadeh
- Shrimai Prabhumoye
- Igor Gitman
- Ximing Lu
- Seungju Han
- Wei Ping
- Yejin Choi
- Jan Kautz
Paper Information
- arXiv ID: 2602.09000v1
- Categories: cs.AI
- Published: February 9, 2026
- PDF: Download PDF