[Paper] Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation

Published: 1 day ago (June 17, 2026 at 01:54 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.19327v1

Overview

Post-training of reasoning language models is commonly driven by supervised distillation and reinforcement learning with verifiable rewards. Distillation often relies on chain-of-thought annotations that are expensive to obtain and may themselves be noisy, incomplete, or partially incorrect; even when the final solution is correct, an imperfect rationale can interfere with learning. Reinforcement learning with verified rewards, on the other hand, typically compresses evaluative feedback into a scalar signal, obscuring which aspects of a response should be improved. We propose \textbf{Rubric-Conditioned Self-Distillation}, a framework that incorporates rubrics as structured, fine-grained feedback for on-policy self-distillation. Our method conditions the teacher model on criterion-level rubrics and uses it to provide token-level guidance on the student’s own sampled trajectories. This design avoids treating a single reference rationale as the sole supervision target. Instead, rubrics specify what a strong response should satisfy, enabling more fine-grained credit assignment over the reasoning process than scalar reward optimization. We instantiate this framework with a two-stage pipeline that first learns to generate task-specific rubrics and then trains a rubric-guided reasoner. We evaluate on a diverse suite of science reasoning benchmarks and results show that rubric-conditioned self-distillation effectively converts rubric-level criteria into token-level guidance over the reasoning process, surpassing GRPO by 1.0 points and OPSD by 0.9 points on average.

Key Contributions

This paper presents research in the following areas:

cs.AI
cs.CL

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.AI.

Authors

Siyi Gu
Jialin Chen
Sophia Zhou
Arman Cohan
Rex Ying

Paper Information

arXiv ID: 2606.19327v1
Categories: cs.AI, cs.CL
Published: June 17, 2026
PDF: Download PDF

[Paper] Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Freeing the Law with LOCUS: A Local Ordinance Corpus for the United States

[Paper] Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA

[Paper] Structured Inference with Large Language Gibbs

[Paper] STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability