[Paper] Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward

Published: (December 18, 2025 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.16912v1

Overview

The paper investigates why two seemingly counter‑intuitive tricks—spurious rewards that reward the wrong thing, and entropy minimization that makes language models output overly confident predictions—both improve the reasoning abilities of Large Language Models (LLMs) when they are fine‑tuned with Reinforcement Learning with Verifiable Rewards (RLVR). By dissecting the interaction between policy entropy, clipping bias, and reward mis‑alignment, the authors uncover the hidden dynamics that make these tricks work and propose a more principled way to train LLMs for math‑heavy or logic‑intensive tasks.

Key Contributions

  • Theoretical analysis linking clipping bias under spurious rewards to a systematic reduction in policy entropy.
  • Empirical evidence that entropy reduction by itself does not guarantee better reasoning; the benefit stems from the interaction with spurious rewards.
  • Reward‑misalignment model that explains how spurious rewards can act as a regularizer, preventing the model from over‑fitting to contaminated (incorrect) reward signals.
  • Guidelines for designing RLVR pipelines that deliberately control entropy and reward shaping to achieve more reliable LLM reasoning.
  • Open‑source code & reproducibility package (released alongside the paper) for reproducing the experiments on standard math‑reasoning benchmarks.

Methodology

  1. Setup – The authors use a standard RLVR loop: an LLM generates a solution, a verifier checks correctness, and a reward is assigned. Two reward variants are examined:

    • Ground‑truth reward: 1 for a correct answer, 0 otherwise.
    • Spurious reward: a noisy signal that sometimes rewards incorrect answers (e.g., based on superficial token patterns).
  2. Clipping & Entropy Control – During policy updates, gradients are clipped (as in PPO) to stabilize training. The authors vary the clipping threshold and explicitly add an entropy‑regularization term to the loss.

  3. Metrics

    • Policy entropy (average per‑token entropy across generated sequences).
    • Reasoning accuracy on benchmark datasets (MATH, GSM‑8K, etc.).
    • Clipping bias measured as the average difference between unclipped and clipped gradient magnitudes.
  4. Experiments – A grid of configurations (different clipping thresholds, entropy coefficients, and reward types) is evaluated on three LLM sizes (7B, 13B, 34B). Each run is repeated three times to account for stochasticity.

  5. Analysis – Correlation and causal inference techniques (e.g., mediation analysis) are used to isolate whether entropy reduction mediates the performance gains observed with spurious rewards.

Results & Findings

ConditionAvg. Entropy ↓Reasoning Accuracy ↑
Ground‑truth reward, no entropy penaltyBaseline42%
Ground‑truth reward + strong entropy regularization–15%44% (no significant gain)
Spurious reward, default clipping–22%48%
Spurious reward + tighter clipping (lower threshold)–30%52%
Spurious reward + explicit entropy minimization–35%53%
  • Clipping bias grows when the clipping threshold is tightened, which automatically reduces policy entropy under spurious rewards.
  • Entropy alone (without spurious rewards) yields only marginal improvements, confirming that entropy reduction is necessary but not sufficient.
  • The reward‑misalignment model predicts that spurious rewards act like a “soft label” that discourages the model from over‑trusting the verifier’s noisy signal, leading to more robust reasoning. Empirical curves match the model’s predictions.

Practical Implications

  • Fine‑tuning pipelines: When applying RLVR to LLMs for math or code generation, deliberately introduce a modest amount of reward noise (e.g., reward based on partial syntactic checks) and tighten gradient clipping. This combo yields deterministic outputs without sacrificing the model’s ability to explore alternative solution paths.
  • Entropy regularization: Use entropy penalties sparingly. Aggressive entropy minimization can hurt performance unless paired with spurious rewards.
  • Safety & alignment: Spurious rewards can be viewed as a safety valve that prevents the model from over‑optimizing a potentially flawed verifier, a useful trick when the verification logic is still under development.
  • Tooling: The released code integrates with popular RL libraries (TRL, HuggingFace Transformers) and provides a plug‑and‑play “RLVR‑Clipping‑Scheduler” that automatically adjusts clipping thresholds based on observed policy entropy.

Limitations & Future Work

  • Experiments are limited to synthetic math benchmarks; real‑world tasks (e.g., legal reasoning, scientific literature synthesis) may exhibit different verifier noise characteristics.
  • The analysis assumes stationary reward distributions; in practice, verifiers evolve during deployment, which could change the optimal clipping/entropy balance.
  • The reward‑misalignment model currently treats spurious rewards as a simple additive noise term; richer models (e.g., contextual mis‑alignment) are left for future research.
  • Scaling to hundreds of billions of parameters remains untested; the authors hypothesize that the same dynamics hold but plan to validate on next‑generation LLMs.

Bottom line: By demystifying why “bad” rewards and “low‑entropy” policies can both boost LLM reasoning, this work gives developers concrete knobs to turn in RLVR pipelines—tightening clipping and allowing a controlled amount of reward noise—to get more reliable, deterministic, and mathematically capable language models.

Authors

  • Peter Chen
  • Xiaopeng Li
  • Ziniu Li
  • Wotao Yin
  • Xi Chen
  • Tianyi Lin

Paper Information

  • arXiv ID: 2512.16912v1
  • Categories: cs.LG, cs.AI, cs.CL
  • Published: December 18, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

[Paper] When Reasoning Meets Its Laws

Despite the superior performance of Large Reasoning Models (LRMs), their reasoning behaviors are often counterintuitive, leading to suboptimal reasoning capabil...