[Paper] Beyond Distribution Sharpening: The Importance of Task Rewards

Published: 2 days ago (April 17, 2026 at 01:17 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.16259v1

Overview

The paper investigates why adding reinforcement‑learning (RL) with task‑specific rewards makes large language models (LLMs) noticeably better, while a simpler technique called distribution sharpening—which merely pushes the model’s output probabilities toward higher‑confidence predictions—fails to deliver the same gains. By directly comparing the two approaches on modern 3‑B‑parameter LLMs, the authors show that true reward‑driven learning can unlock new capabilities, whereas sharpening only tweaks an already‑biased distribution and can even become unstable.

Key Contributions

Formal comparison of distribution sharpening vs. task‑reward RL, highlighting fundamental differences in their optimization landscapes.
Theoretical analysis proving that sharpening can converge to sub‑optimal or unstable optima, while reward‑based RL aligns gradients with the actual task objective.
Empirical study on three open‑source instruction‑tuned models (Llama‑3.2‑3B‑Instruct, Qwen2.5‑3B‑Instruct, Qwen3‑4B‑Instruct‑2507) across several math benchmarks.
Demonstration that reward‑driven fine‑tuning yields sizable, stable performance improvements, whereas sharpening provides only marginal gains.
Practical guidelines for developers on when and how to incorporate RL‑based reward signals into their model pipelines.

Methodology

Define the two objectives
- Distribution Sharpening: Add a regularizer that raises the entropy‑reduction term (e.g., KL divergence toward a delta distribution) so the model becomes more confident on its own predictions.
- Task‑Reward RL: Use a scalar reward (e.g., correctness on a math problem) and apply a policy‑gradient algorithm (PPO) to maximize expected reward.
Unified implementation
- Both objectives are optimized with the same RL engine (PPO) to ensure a fair comparison; the only difference is the reward function (sharpening vs. task reward).
Benchmarks
- A suite of arithmetic and algebra problems (e.g., GSM8K‑lite, MATH‑mini) is used to evaluate the models before and after fine‑tuning.
Metrics
- Accuracy / exact match, reward‑weighted score, and training stability indicators (gradient variance, KL‑drift).
Analysis tools
- Gradient alignment visualizations, loss‑surface plots, and a simple first‑principles derivation showing why sharpening can push the model toward “over‑confident but wrong” modes.

Results & Findings

Model	Baseline Acc.	Sharpening Δ	RL‑Reward Δ
Llama‑3.2‑3B‑Instruct	42.1 %	+2.3 % (unstable)	+9.8 % (stable)
Qwen2.5‑3B‑Instruct	38.7 %	+1.7 % (high variance)	+11.2 %
Qwen3‑4B‑Instruct‑2507	45.5 %	+2.0 % (occasional collapse)	+12.5 %

Sharpening consistently produced only modest improvements and sometimes caused the KL‑divergence to explode, leading to degenerate outputs.
Reward‑based RL delivered double‑digit accuracy lifts and maintained low KL‑drift, confirming stable learning.
Gradient‑alignment plots showed that the reward signal points toward directions that correct systematic errors (e.g., off‑by‑one arithmetic), whereas sharpening gradients merely amplify existing confidence, regardless of correctness.

Practical Implications

For developers building agents (e.g., code assistants, tutoring bots), incorporating a task‑specific reward is far more effective than simply “making the model more confident.”
Fine‑tuning pipelines can reuse existing PPO implementations; the only extra engineering effort is designing a reliable reward function (e.g., automated test harnesses for code, answer checkers for math).
Safety & alignment: Reward‑driven RL provides a clearer interpretability hook—developers can audit the reward function—while sharpening offers no guarantee that higher confidence aligns with truthfulness.
Resource budgeting: The experiments used 3‑B‑scale models and a few hundred GPU hours, suggesting that even small teams can reap RL benefits without needing massive compute.
Tooling: Open‑source libraries like trl or OpenRLHF can be adapted directly; the paper’s code release includes a “sharpening‑as‑RL” baseline for quick side‑by‑side testing.

Limitations & Future Work

The study focuses on math reasoning; other domains (e.g., dialog, code generation) may exhibit different dynamics.
Only small‑to‑mid‑size models (≤4 B parameters) were evaluated; scaling behavior on 70 B+ models remains an open question.
Reward design still requires task‑specific engineering; automating reward synthesis (e.g., via LLM‑generated evaluators) is a promising direction.
The authors note that distribution sharpening could be combined with reward‑based RL as a regularizer to improve exploration—future work could explore hybrid objectives.

Bottom line: If you’re looking to push a language model from “good at answering” to “good at doing a task,” the paper makes a compelling case for investing in a well‑crafted reward signal and RL fine‑tuning, rather than relying on confidence‑boosting tricks alone.

Authors

Sarthak Mittal
Leo Gagnon
Guillaume Lajoie

Paper Information

arXiv ID: 2604.16259v1
Categories: cs.LG, cs.AI
Published: April 17, 2026
PDF: Download PDF

[Paper] Beyond Distribution Sharpening: The Importance of Task Rewards

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] ASMR-Bench: Auditing for Sabotage in ML Research

[Paper] Geometric regularization of autoencoders via observed stochastic dynamics

[Paper] Using Large Language Models and Knowledge Graphs to Improve the Interpretability of Machine Learning Models in Manufacturing

[Paper] Evaluating the Progression of Large Language Model Capabilities for Small-Molecule Drug Design