[Paper] Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

Published: (May 7, 2026 at 01:55 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2605.06650v1

Overview

The paper introduces Positive‑Only Policy Optimization (POPO), a new reinforcement‑learning‑with‑verifiable‑rewards (RLVR) technique for fine‑tuning large language models (LLMs) on reasoning tasks. By discarding negative rollouts entirely and relying only on “good” samples, POPO simplifies the training loop while still delivering performance that matches or exceeds the current state‑of‑the‑art Group Relative Policy Optimization (GRPO).

Key Contributions

  • Positive‑only learning framework – eliminates the need for explicit negative rollouts, using bounded importance sampling over the set of successful trajectories.
  • Implicit negative gradients – demonstrates that penalizing bad behavior can emerge naturally from reinforcing positive probabilities, removing the need for a separate loss term.
  • Siamese policy network with momentum adaptation – stabilizes policy updates by keeping a slowly‑moving copy of the policy and aligning them in a shared representation space.
  • Bounded similarity penalty – replaces the conventional KL‑divergence with a tractable similarity term that works directly on the siamese embeddings.
  • Empirical validation on math benchmarks – POPO attains 36.67 % on the AIME 2025 test set with Qwen‑Math‑7B, surpassing GRPO’s 30 % and matching results across other levels of difficulty.
  • Extensive ablations – confirm that each component (importance‑sampling bound, siamese architecture, momentum update) contributes to robustness and final accuracy.

Methodology

  1. Rollout collection – During each training iteration the policy generates a batch of completions. Only those that satisfy a deterministic verifier (e.g., a correct answer to a math problem) are kept as positive rollouts.
  2. Bounded importance sampling – The probability of each positive rollout under the current policy is re‑weighted by a capped importance‑sampling ratio, preventing extreme variance while still correcting for distribution shift.
  3. Siamese architecture – Two copies of the policy network are maintained: the online policy that is being updated, and a target policy that evolves slowly via a momentum rule (θ_target ← τ·θ_target + (1‑τ)·θ_online). Both share the same encoder but have separate heads.
  4. Similarity penalty – Instead of KL‑divergence, a bounded distance (e.g., cosine similarity clipped to a maximum value) between the online and target embeddings is added to the loss, encouraging smooth policy changes.
  5. Optimization – The final loss combines the (positive‑only) policy‑gradient term with the similarity penalty. Gradient descent updates the online policy; the target policy follows automatically via momentum.

The key insight is that by boosting the probability of successful actions, the algorithm indirectly pushes down the probability of unseen or unsuccessful actions, achieving an effect similar to explicit negative gradients without ever sampling them.

Results & Findings

Model (7B)BenchmarkGRPO (%)POPO (%)
Qwen‑MathAIME 202530.0036.67
Qwen‑MathAIME 202428.429.1
Qwen‑MathAIME 202327.927.9
  • Comparable or superior performance across all difficulty tiers, with the biggest gain on the hardest test (AIME 2025).
  • Stability – training curves show less oscillation and lower variance when using the siamese‑momentum + similarity penalty compared with vanilla PPO/GRPO.
  • Ablation – removing the importance‑sampling bound or the similarity penalty drops performance by ~4–5 pp, confirming their necessity.
  • Sample efficiency – POPO reaches peak performance with ~20 % fewer environment interactions than GRPO, thanks to the focused positive rollout set.

Practical Implications

  • Simpler pipelines – No need to design or tune advantage estimators for negative samples; developers can plug POPO into existing RLHF‑style fine‑tuning scripts with minimal changes.
  • Reduced compute waste – By discarding negative rollouts early, GPU cycles are spent only on trajectories that actually contribute to learning, lowering training cost for large LLMs.
  • Better handling of sparse binary rewards – Tasks where success is rare (e.g., formal proof generation, code synthesis) benefit from the positive‑only bias, which avoids the “signal dilution” problem of sparse negatives.
  • Safer policy updates – The similarity penalty in representation space offers a more interpretable and bounded notion of policy drift than KL, which can be useful for compliance‑oriented deployments.
  • Potential for other domains – The same idea can be applied to RL for robotics, game AI, or any setting where a deterministic verifier can label successes (e.g., unit tests for code generation).

Limitations & Future Work

  • Dependence on a perfect verifier – POPO assumes a deterministic, noise‑free reward signal; noisy or probabilistic verification could re‑introduce bias.
  • Limited exploration – By focusing solely on positives, the policy may miss novel strategies that initially appear sub‑optimal; hybrid schemes that occasionally sample negatives could mitigate this.
  • Scalability to multi‑modal tasks – The current experiments are confined to text‑based math reasoning; extending to vision‑language or interactive environments remains an open question.
  • Theoretical guarantees – While empirical results are strong, formal convergence proofs for the implicit negative gradient mechanism are not yet provided.

Future research directions include integrating uncertainty‑aware verifiers, combining POPO with curriculum learning to broaden exploration, and testing the framework on large‑scale instruction‑following models beyond the Qwen family.

Authors

  • Mingwei Xu
  • Hao Fang

Paper Information

  • arXiv ID: 2605.06650v1
  • Categories: cs.CL
  • Published: May 7, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »