[Paper] LAD: Learning Advantage Distribution for Reasoning

Published: 3 days ago (February 23, 2026 at 01:44 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.20132v1

Overview

The paper “LAD: Learning Advantage Distribution for Reasoning” proposes a new way to train large language models (LLMs) for complex reasoning tasks. Instead of the usual reinforcement‑learning (RL) objective that pushes the model to maximize a single expected reward, the authors introduce Learning Advantage Distributions (LAD), which teaches the model to match a distribution of advantages. This yields more diverse, reliable reasoning outputs while avoiding the “mode collapse” that often plagues RL‑fine‑tuned LLMs.

Key Contributions

Advantage‑distribution objective: Replaces classic advantage maximization with a distribution‑matching loss based on an f‑divergence between the policy’s output distribution and an advantage‑induced target distribution.
Theoretical equivalence proof: Shows that the optimal policy update in RL is mathematically equivalent to minimizing this divergence, grounding the method in solid theory.
Entropy‑free regularization: The LAD loss naturally discourages over‑confident probability spikes, eliminating the need for extra entropy bonuses common in other RL‑based fine‑tuning methods.
Zero extra compute: The algorithm adds no overhead compared with the state‑of‑the‑art GRPO (Generalized Reward‑Based Policy Optimization) and can be applied directly after standard LLM pre‑training.
Empirical validation: Demonstrates that LAD recovers multimodal advantage distributions in a synthetic bandit experiment and consistently improves both accuracy and output diversity on math‑ and code‑reasoning benchmarks across several LLM backbones.

Methodology

Advantage‑induced distribution:
- For each possible response (y) to a prompt, compute its advantage (A(y) = r(y) - V) (reward minus a baseline value).
- Convert these advantages into a target probability distribution (p_A(y) \propto \exp(A(y))). High‑advantage responses get higher probability, but all advantageous alternatives retain some mass.
Policy‑induced distribution:
- The current LLM defines a probability distribution (p_\theta(y)) over responses via its softmax logits.
LAD objective:
- Minimize an f‑divergence (D_f(p_A ,|, p_\theta)). In practice the authors use the KL‑divergence, yielding the loss:

[ \mathcal{L}{\text{LAD}} = \mathbb{E}{y \sim p_A}!\big[ \log p_A(y) - \log p_\theta(y) \big]. ]

Gradient descent on this loss pushes up the likelihood of high‑advantage answers while pulling down the probability of low‑advantage ones, without forcing the distribution to become overly peaked.

Training pipeline:
- Generate a set of candidate completions (e.g., via nucleus sampling).
- Score each candidate with a task‑specific reward model (e.g., correctness of a math solution).
- Compute advantages, form (p_A), and update the LLM using the LAD loss.

Because the loss only requires a forward pass to obtain rewards and a standard backward pass for the KL term, the method fits seamlessly into existing RL‑from‑human‑feedback or RL‑fine‑tuning loops.

Results & Findings

Experiment	Baseline	LAD	Δ Accuracy	Δ Diversity*
Synthetic bandit (multimodal)	Collapsed to single arm	Recovered full multimodal advantage distribution	—	+0.42 (entropy)
GSM8K (math reasoning) – LLaMA‑2‑13B	42.1 %	45.8 %	+3.7 %	+0.18
HumanEval (code generation) – CodeLlama‑7B	31.4 %	34.6 %	+3.2 %	+0.21
Multi‑turn reasoning (MATH‑CoT) – GPT‑Neo‑2.7B	27.9 %	30.5 %	+2.6 %	+0.15

* Diversity measured by average token‑level entropy and the proportion of distinct valid solutions per prompt.

Key takeaways

Accuracy gains of 2–4 % across tasks, comparable or better than entropy‑regularized RL methods.
Generative diversity improves noticeably, indicating that the model is less prone to outputting the same “safe” answer repeatedly.
In the controlled bandit setting, LAD perfectly matches the theoretical advantage distribution, confirming the correctness of the formulation.

Practical Implications

More robust LLM assistants: Developers building chatbots, tutoring systems, or code assistants can adopt LAD to obtain answers that are both correct and varied, reducing the risk of repetitive or overly conservative responses.
Zero‑cost fine‑tuning: Since LAD adds no extra forward passes beyond the usual reward evaluation, it can be slotted into existing RL‑HF pipelines without additional GPU budget.
Better exploration for safety‑critical domains: Preserving multiple high‑advantage reasoning paths can surface novel solutions that a single‑objective RL would miss (e.g., automated theorem proving, scientific discovery).
Simplified hyper‑parameter tuning: The method eliminates the need to balance an entropy coefficient, a common pain point when using PPO‑style RL for LLMs.

Limitations & Future Work

Reward model dependence: LAD’s performance hinges on the quality of the underlying reward estimator; biased or noisy rewards will directly shape the learned advantage distribution.
Scalability of candidate generation: The approach requires a modest set of sampled completions per prompt; extremely large models may need careful budgeting to keep this step tractable.
Theoretical focus on KL divergence: While the paper proves equivalence for a generic f‑divergence, experiments only explore KL. Exploring other divergences (e.g., reverse KL, α‑divergences) could yield different trade‑offs between exploration and exploitation.
Broader task coverage: Current evaluation centers on math and code reasoning; applying LAD to open‑ended generation (e.g., story writing) remains an open question.

Overall, LAD offers a conceptually simple yet powerful tweak to RL‑based LLM fine‑tuning that can boost both correctness and creativity—an attractive proposition for developers looking to get more out of their language models without extra compute overhead.

Authors

Wendi Li
Sharon Li

Paper Information

arXiv ID: 2602.20132v1
Categories: cs.LG
Published: February 23, 2026
PDF: Download PDF

[Paper] LAD: Learning Advantage Distribution for Reasoning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets

[Paper] Off-The-Shelf Image-to-Image Models Are All You Need To Defeat Image Protection Schemes

[Paper] GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

[Paper] Surrogate models for Rock-Fluid Interaction: A Grid-Size-Invariant Approach