[Paper] Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation

Published: (June 17, 2026 at 01:54 PM EDT)
2 min read
Source: arXiv

Source: arXiv - 2606.19327v1

Overview

Post-training of reasoning language models is commonly driven by supervised distillation and reinforcement learning with verifiable rewards. Distillation often relies on chain-of-thought annotations that are expensive to obtain and may themselves be noisy, incomplete, or partially incorrect; even when the final solution is correct, an imperfect rationale can interfere with learning. Reinforcement learning with verified rewards, on the other hand, typically compresses evaluative feedback into a scalar signal, obscuring which aspects of a response should be improved. We propose \textbf{Rubric-Conditioned Self-Distillation}, a framework that incorporates rubrics as structured, fine-grained feedback for on-policy self-distillation. Our method conditions the teacher model on criterion-level rubrics and uses it to provide token-level guidance on the student’s own sampled trajectories. This design avoids treating a single reference rationale as the sole supervision target. Instead, rubrics specify what a strong response should satisfy, enabling more fine-grained credit assignment over the reasoning process than scalar reward optimization. We instantiate this framework with a two-stage pipeline that first learns to generate task-specific rubrics and then trains a rubric-guided reasoner. We evaluate on a diverse suite of science reasoning benchmarks and results show that rubric-conditioned self-distillation effectively converts rubric-level criteria into token-level guidance over the reasoning process, surpassing GRPO by 1.0 points and OPSD by 0.9 points on average.

Key Contributions

This paper presents research in the following areas:

  • cs.AI
  • cs.CL

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.AI.

Authors

  • Siyi Gu
  • Jialin Chen
  • Sophia Zhou
  • Arman Cohan
  • Rex Ying

Paper Information

  • arXiv ID: 2606.19327v1
  • Categories: cs.AI, cs.CL
  • Published: June 17, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »