[Paper] Beyond Distribution Sharpening: The Importance of Task Rewards
Source: arXiv - 2604.16259v1
Overview
The paper investigates why adding reinforcement‑learning (RL) with task‑specific rewards makes large language models (LLMs) noticeably better, while a simpler technique called distribution sharpening—which merely pushes the model’s output probabilities toward higher‑confidence predictions—fails to deliver the same gains. By directly comparing the two approaches on modern 3‑B‑parameter LLMs, the authors show that true reward‑driven learning can unlock new capabilities, whereas sharpening only tweaks an already‑biased distribution and can even become unstable.
Key Contributions
- Formal comparison of distribution sharpening vs. task‑reward RL, highlighting fundamental differences in their optimization landscapes.
- Theoretical analysis proving that sharpening can converge to sub‑optimal or unstable optima, while reward‑based RL aligns gradients with the actual task objective.
- Empirical study on three open‑source instruction‑tuned models (Llama‑3.2‑3B‑Instruct, Qwen2.5‑3B‑Instruct, Qwen3‑4B‑Instruct‑2507) across several math benchmarks.
- Demonstration that reward‑driven fine‑tuning yields sizable, stable performance improvements, whereas sharpening provides only marginal gains.
- Practical guidelines for developers on when and how to incorporate RL‑based reward signals into their model pipelines.
Methodology
-
Define the two objectives
- Distribution Sharpening: Add a regularizer that raises the entropy‑reduction term (e.g., KL divergence toward a delta distribution) so the model becomes more confident on its own predictions.
- Task‑Reward RL: Use a scalar reward (e.g., correctness on a math problem) and apply a policy‑gradient algorithm (PPO) to maximize expected reward.
-
Unified implementation
- Both objectives are optimized with the same RL engine (PPO) to ensure a fair comparison; the only difference is the reward function (sharpening vs. task reward).
-
Benchmarks
- A suite of arithmetic and algebra problems (e.g., GSM8K‑lite, MATH‑mini) is used to evaluate the models before and after fine‑tuning.
-
Metrics
- Accuracy / exact match, reward‑weighted score, and training stability indicators (gradient variance, KL‑drift).
-
Analysis tools
- Gradient alignment visualizations, loss‑surface plots, and a simple first‑principles derivation showing why sharpening can push the model toward “over‑confident but wrong” modes.
Results & Findings
| Model | Baseline Acc. | Sharpening Δ | RL‑Reward Δ |
|---|---|---|---|
| Llama‑3.2‑3B‑Instruct | 42.1 % | +2.3 % (unstable) | +9.8 % (stable) |
| Qwen2.5‑3B‑Instruct | 38.7 % | +1.7 % (high variance) | +11.2 % |
| Qwen3‑4B‑Instruct‑2507 | 45.5 % | +2.0 % (occasional collapse) | +12.5 % |
- Sharpening consistently produced only modest improvements and sometimes caused the KL‑divergence to explode, leading to degenerate outputs.
- Reward‑based RL delivered double‑digit accuracy lifts and maintained low KL‑drift, confirming stable learning.
- Gradient‑alignment plots showed that the reward signal points toward directions that correct systematic errors (e.g., off‑by‑one arithmetic), whereas sharpening gradients merely amplify existing confidence, regardless of correctness.
Practical Implications
- For developers building agents (e.g., code assistants, tutoring bots), incorporating a task‑specific reward is far more effective than simply “making the model more confident.”
- Fine‑tuning pipelines can reuse existing PPO implementations; the only extra engineering effort is designing a reliable reward function (e.g., automated test harnesses for code, answer checkers for math).
- Safety & alignment: Reward‑driven RL provides a clearer interpretability hook—developers can audit the reward function—while sharpening offers no guarantee that higher confidence aligns with truthfulness.
- Resource budgeting: The experiments used 3‑B‑scale models and a few hundred GPU hours, suggesting that even small teams can reap RL benefits without needing massive compute.
- Tooling: Open‑source libraries like
trlorOpenRLHFcan be adapted directly; the paper’s code release includes a “sharpening‑as‑RL” baseline for quick side‑by‑side testing.
Limitations & Future Work
- The study focuses on math reasoning; other domains (e.g., dialog, code generation) may exhibit different dynamics.
- Only small‑to‑mid‑size models (≤4 B parameters) were evaluated; scaling behavior on 70 B+ models remains an open question.
- Reward design still requires task‑specific engineering; automating reward synthesis (e.g., via LLM‑generated evaluators) is a promising direction.
- The authors note that distribution sharpening could be combined with reward‑based RL as a regularizer to improve exploration—future work could explore hybrid objectives.
Bottom line: If you’re looking to push a language model from “good at answering” to “good at doing a task,” the paper makes a compelling case for investing in a well‑crafted reward signal and RL fine‑tuning, rather than relying on confidence‑boosting tricks alone.
Authors
- Sarthak Mittal
- Leo Gagnon
- Guillaume Lajoie
Paper Information
- arXiv ID: 2604.16259v1
- Categories: cs.LG, cs.AI
- Published: April 17, 2026
- PDF: Download PDF