[Paper] On-Policy Self-Distillation for Reasoning Compression

Published: 16 hours ago (March 5, 2026 at 12:54 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2603.05433v1

Overview

Large language models (LLMs) that solve math or logic problems often “think out loud,” generating long chains of reasoning that contain a lot of filler and even harmful noise. The paper On‑Policy Self‑Distillation for Reasoning Compression (OPSDC) proposes a surprisingly simple trick: let the model teach itself to be concise by distilling its own “be concise” outputs back into its normal generation process. The result is dramatically shorter reasoning traces—up to 60 % fewer tokens—while boosting accuracy on challenging math benchmarks.

Key Contributions

Self‑distillation pipeline that requires no external labels, token‑budget constraints, or difficulty estimators.
Dynamic compression: easy problems are aggressively shortened, hard problems retain necessary deliberation.
Empirical gains on state‑of‑the‑art LLMs (Qwen‑3 8B/14B):
- 57‑59 % token reduction on MATH‑500 with +9‑16 % absolute accuracy.
- 41 % token reduction on AIME 2024 with +10 % accuracy for the 14B model.
Insight that much of the verbose reasoning generated by LLMs is not just redundant but actively harmful, propagating errors.

Methodology

Two‑phase rollout
- Teacher pass: Prompt the same model with an explicit “be concise” instruction and let it generate a short reasoning trace.
- Student pass: Let the model generate its usual (potentially verbose) reasoning without the instruction.
Reverse KL distillation
- For each token in the student’s rollout, compute the reverse Kullback‑Leibler (KL) divergence between the student’s token distribution and the teacher’s logits (the probability scores from the concise pass).
- Minimize this loss, effectively nudging the student to mimic the teacher’s concise style while still preserving the original answer.
On‑policy learning
- Both teacher and student are the same model, so the distillation stays “on‑policy” (no external teacher model needed).
No extra supervision
- The process does not require ground‑truth solutions, token‑budget hyper‑parameters, or a separate difficulty estimator; the model’s own concise output serves as the supervision signal.

Results & Findings

Benchmark	Model	Token Reduction	Accuracy Δ (abs.)
MATH‑500	Qwen‑3‑8B	~57 %	+9 pts
MATH‑500	Qwen‑3‑14B	~59 %	+16 pts
AIME 2024	Qwen‑3‑14B	~41 %	+10 pts

Compression is adaptive: easy questions see the biggest cuts, while harder ones keep more tokens, preserving the depth of reasoning where it matters.
Error propagation is mitigated: by removing unnecessary steps, the model makes fewer opportunities to introduce mistakes, leading to higher final scores.
Training cost is modest: because the teacher and student are the same model, the extra compute is limited to a second forward pass per example.

Practical Implications

Faster inference & lower cost – Shorter token sequences mean reduced API usage, lower latency, and cheaper compute for any service that relies on LLM reasoning (e.g., tutoring bots, code‑assistants, scientific QA).
Cleaner logs for debugging – Concise reasoning traces are easier for engineers to inspect, audit, and fine‑tune.
Improved reliability in downstream pipelines – When LLM outputs feed into other systems (e.g., automated grading, symbolic solvers), fewer spurious tokens reduce the chance of downstream parsing errors.
Plug‑and‑play – OPSDC can be added as a post‑training fine‑tuning step to any existing decoder‑only model without needing a separate teacher model or curated datasets.

Limitations & Future Work

The approach still depends on the model’s ability to follow the “be concise” instruction; very small or poorly calibrated models may not generate useful teacher traces.
Reverse KL distillation focuses on matching teacher logits token‑wise, which may overlook higher‑level structural aspects of reasoning (e.g., logical flow).
The paper evaluates primarily on math benchmarks; extending to other reasoning domains (code generation, commonsense QA) remains an open question.
Future research could explore curriculum‑style self‑distillation, where the conciseness instruction is gradually relaxed, or combine OPSDC with external knowledge‑bases to further boost correctness.

Authors

Hejian Sang
Yuanda Xu
Zhengze Zhou
Ran He
Zhipeng Wang
Jiachen Sun

Paper Information

arXiv ID: 2603.05433v1
Categories: cs.LG
Published: March 5, 2026
PDF: Download PDF

[Paper] On-Policy Self-Distillation for Reasoning Compression

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] RoboPocket: Improve Robot Policies Instantly with Your Phone

[Paper] POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation

[Paper] The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

[Paper] Cheap Thrills: Effective Amortized Optimization Using Inexpensive Labels