[Paper] Reinforcement Learning from Rich Feedback with Distributional DAgger

Published: (June 3, 2026 at 01:54 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2606.05152v1

Overview

The paper “Reinforcement Learning from Rich Feedback with Distributional DAgger” tackles a practical bottleneck in today’s RL‑for‑reasoning pipelines: they usually rely on a single binary reward (right or wrong) for each generated answer. In many real‑world scenarios—code execution, tool use, expert corrections, or model self‑evaluations—we actually have much richer signals. The authors propose a new imitation‑learning‑style algorithm, DistIL, that can ingest these detailed feedback streams and turn them into more efficient, monotonic policy improvements.

Key Contributions

  • Distributional DAgger: Extends the classic DAgger algorithm to work with expert distributions over future states instead of a single deterministic expert action.
  • Forward cross‑entropy objective: Derives a simple, black‑box‑compatible loss that propagates disagreement between the learner and the expert back through the decision sequence, enabling fine‑grained credit assignment.
  • Theoretical guarantees: Proves that the forward cross‑entropy loss ensures monotonic policy improvement and provides regret bounds, unlike prior self‑distillation approaches based on reverse KL or Jensen‑Shannon divergences.
  • Lower‑bound on teacher‑weighted success likelihood: Shows the objective optimizes a bound that directly improves metrics such as Pass@N for code generation and reasoning tasks.
  • Empirical validation: Demonstrates consistent gains over standard RL‑with‑verifiable‑rewards (RLVR) and self‑distillation baselines across three challenging domains: scientific reasoning, programming (code synthesis), and hard mathematical problem solving.

Methodology

  1. Rich feedback as an expert distribution – When the current policy generates a trajectory (e.g., a sequence of reasoning steps or code lines), the system collects auxiliary signals (execution traces, tool outputs, expert edits). These signals are used to construct a distribution over the “correct” next actions rather than a single ground‑truth label.

  2. Distributional DAgger loop

    • Roll‑out: Run the learner policy to collect a batch of trajectories.
    • Query expert: For each visited state, query the black‑box expert (which could be a human, a more powerful model, or a simulator) to obtain the distribution over desirable next actions.
    • Update: Minimize the forward cross‑entropy between the learner’s action distribution and the expert’s distribution. This is equivalent to maximizing the likelihood that the learner will follow the expert’s “good” choices in the future.
  3. Credit assignment – Because the loss is forward‑looking, the gradient of a disagreement at a later step flows back to earlier decisions, letting the learner understand how early choices affect downstream success.

  4. Monotonic improvement proof – By framing the update as a forward KL (cross‑entropy) minimization, the authors show each iteration cannot degrade the expected reward under the expert’s distribution, unlike reverse‑KL based self‑distillation which can inadvertently increase the probability of bad actions.

Results & Findings

DomainBaseline (RLVR)Self‑DistillationDistIL (proposed)
Scientific reasoning (benchmark QA)62.4 % Pass@164.1 %
Code synthesis (HumanEval)48.7 % Pass@151.3 %
Hard math problems (MATH)31.2 % Pass@133.0 %
Pass@N (N=10) improvements up to +9 % over RLVR
  • Monotonicity: In ablation studies, forward cross‑entropy never decreased the expert‑weighted reward across 10k updates, whereas reverse‑KL updates caused occasional regressions.
  • Sample efficiency: DistIL reached comparable performance to RLVR with roughly 30 % fewer environment interactions, thanks to richer credit assignment.
  • Robustness to noisy experts: Even when the expert distribution was partially corrupted (simulating imperfect human feedback), DistIL maintained stable gains, while self‑distillation degraded sharply.

Practical Implications

  • Developer tooling: For code‑generation assistants (e.g., Copilot‑style models), integrating execution traces or compiler diagnostics as feedback can be done via DistIL without redesigning the whole RL pipeline.
  • Tool‑augmented agents: Agents that call external APIs, databases, or calculators can feed the API responses back into the training loop, turning each tool call into a richer supervisory signal.
  • Reduced reliance on binary reward engineering: Teams no longer need to craft brittle reward functions; any observable outcome (test pass/fail, human edit, confidence score) can be transformed into an expert distribution.
  • Faster iteration cycles: Because DistIL extracts more learning signal per interaction, product teams can iterate on model improvements with fewer costly compute‑heavy roll‑outs.
  • Compatibility: The forward cross‑entropy loss works with any black‑box expert, meaning existing pipelines that already collect logs or human corrections can adopt DistIL with minimal code changes.

Limitations & Future Work

  • Expert distribution quality: The method assumes the expert can provide a reasonably calibrated distribution. Poorly estimated distributions (e.g., noisy human annotations) can limit gains.
  • Scalability of expert queries: In very large state spaces, querying the expert for every visited state may become a bottleneck; the paper suggests sampling strategies but leaves systematic exploration to future work.
  • Extension to continuous action spaces: The current formulation focuses on discrete token‑level decisions; adapting the forward cross‑entropy to continuous control (e.g., robotics) remains an open challenge.
  • Long‑horizon credit assignment: While forward propagation improves credit assignment, extremely long reasoning chains may still suffer from vanishing gradients; hierarchical or memory‑augmented variants are proposed as next steps.

Overall, the paper offers a compelling, theoretically grounded recipe for turning the abundant “side‑channel” feedback that modern AI systems generate into concrete performance gains—an advance that should resonate with developers building next‑generation reasoning and code‑generation tools.

Authors

  • Rishabh Agrawal
  • Jacob Fein-Ashley
  • Paria Rashidinejad

Paper Information

  • arXiv ID: 2606.05152v1
  • Categories: cs.LG, cs.AI, cs.CL
  • Published: June 3, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »