[Paper] Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward

Published: 1 month ago (December 18, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.16912v1

Overview

The paper investigates why two seemingly counter‑intuitive tricks—spurious rewards that reward the wrong thing, and entropy minimization that makes language models output overly confident predictions—both improve the reasoning abilities of Large Language Models (LLMs) when they are fine‑tuned with Reinforcement Learning with Verifiable Rewards (RLVR). By dissecting the interaction between policy entropy, clipping bias, and reward mis‑alignment, the authors uncover the hidden dynamics that make these tricks work and propose a more principled way to train LLMs for math‑heavy or logic‑intensive tasks.

Key Contributions

Theoretical analysis linking clipping bias under spurious rewards to a systematic reduction in policy entropy.
Empirical evidence that entropy reduction by itself does not guarantee better reasoning; the benefit stems from the interaction with spurious rewards.
Reward‑misalignment model that explains how spurious rewards can act as a regularizer, preventing the model from over‑fitting to contaminated (incorrect) reward signals.
Guidelines for designing RLVR pipelines that deliberately control entropy and reward shaping to achieve more reliable LLM reasoning.
Open‑source code & reproducibility package (released alongside the paper) for reproducing the experiments on standard math‑reasoning benchmarks.

Methodology

Setup – The authors use a standard RLVR loop: an LLM generates a solution, a verifier checks correctness, and a reward is assigned. Two reward variants are examined:
- Ground‑truth reward: 1 for a correct answer, 0 otherwise.
- Spurious reward: a noisy signal that sometimes rewards incorrect answers (e.g., based on superficial token patterns).
Clipping & Entropy Control – During policy updates, gradients are clipped (as in PPO) to stabilize training. The authors vary the clipping threshold and explicitly add an entropy‑regularization term to the loss.
Metrics –
- Policy entropy (average per‑token entropy across generated sequences).
- Reasoning accuracy on benchmark datasets (MATH, GSM‑8K, etc.).
- Clipping bias measured as the average difference between unclipped and clipped gradient magnitudes.
Experiments – A grid of configurations (different clipping thresholds, entropy coefficients, and reward types) is evaluated on three LLM sizes (7B, 13B, 34B). Each run is repeated three times to account for stochasticity.
Analysis – Correlation and causal inference techniques (e.g., mediation analysis) are used to isolate whether entropy reduction mediates the performance gains observed with spurious rewards.

Results & Findings

Condition	Avg. Entropy ↓	Reasoning Accuracy ↑
Ground‑truth reward, no entropy penalty	Baseline	42%
Ground‑truth reward + strong entropy regularization	–15%	44% (no significant gain)
Spurious reward, default clipping	–22%	48%
Spurious reward + tighter clipping (lower threshold)	–30%	52%
Spurious reward + explicit entropy minimization	–35%	53%

Clipping bias grows when the clipping threshold is tightened, which automatically reduces policy entropy under spurious rewards.
Entropy alone (without spurious rewards) yields only marginal improvements, confirming that entropy reduction is necessary but not sufficient.
The reward‑misalignment model predicts that spurious rewards act like a “soft label” that discourages the model from over‑trusting the verifier’s noisy signal, leading to more robust reasoning. Empirical curves match the model’s predictions.

Practical Implications

Fine‑tuning pipelines: When applying RLVR to LLMs for math or code generation, deliberately introduce a modest amount of reward noise (e.g., reward based on partial syntactic checks) and tighten gradient clipping. This combo yields deterministic outputs without sacrificing the model’s ability to explore alternative solution paths.
Entropy regularization: Use entropy penalties sparingly. Aggressive entropy minimization can hurt performance unless paired with spurious rewards.
Safety & alignment: Spurious rewards can be viewed as a safety valve that prevents the model from over‑optimizing a potentially flawed verifier, a useful trick when the verification logic is still under development.
Tooling: The released code integrates with popular RL libraries (TRL, HuggingFace Transformers) and provides a plug‑and‑play “RLVR‑Clipping‑Scheduler” that automatically adjusts clipping thresholds based on observed policy entropy.

Limitations & Future Work

Experiments are limited to synthetic math benchmarks; real‑world tasks (e.g., legal reasoning, scientific literature synthesis) may exhibit different verifier noise characteristics.
The analysis assumes stationary reward distributions; in practice, verifiers evolve during deployment, which could change the optimal clipping/entropy balance.
The reward‑misalignment model currently treats spurious rewards as a simple additive noise term; richer models (e.g., contextual mis‑alignment) are left for future research.
Scaling to hundreds of billions of parameters remains untested; the authors hypothesize that the same dynamics hold but plan to validate on next‑generation LLMs.

Bottom line: By demystifying why “bad” rewards and “low‑entropy” policies can both boost LLM reasoning, this work gives developers concrete knobs to turn in RLVR pipelines—tightening clipping and allowing a controlled amount of reward noise—to get more reliable, deterministic, and mathematically capable language models.

Authors

Peter Chen
Xiaopeng Li
Ziniu Li
Wotao Yin
Xi Chen
Tianyi Lin

Paper Information

arXiv ID: 2512.16912v1
Categories: cs.LG, cs.AI, cs.CL
Published: December 18, 2025
PDF: Download PDF

[Paper] Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] When Reasoning Meets Its Laws

[Paper] ShareChat: A Dataset of Chatbot Conversations in the Wild

[Paper] Bangla MedER: Multi-BERT Ensemble Approach for the Recognition of Bangla Medical Entity

[Paper] AncientBench: Towards Comprehensive Evaluation on Excavated and Transmitted Chinese Corpora