[Paper] On-Policy Self-Distillation for Reasoning Compression
Source: arXiv - 2603.05433v1
Overview
Large language models (LLMs) that solve math or logic problems often “think out loud,” generating long chains of reasoning that contain a lot of filler and even harmful noise. The paper On‑Policy Self‑Distillation for Reasoning Compression (OPSDC) proposes a surprisingly simple trick: let the model teach itself to be concise by distilling its own “be concise” outputs back into its normal generation process. The result is dramatically shorter reasoning traces—up to 60 % fewer tokens—while boosting accuracy on challenging math benchmarks.
Key Contributions
- Self‑distillation pipeline that requires no external labels, token‑budget constraints, or difficulty estimators.
- Dynamic compression: easy problems are aggressively shortened, hard problems retain necessary deliberation.
- Empirical gains on state‑of‑the‑art LLMs (Qwen‑3 8B/14B):
- 57‑59 % token reduction on MATH‑500 with +9‑16 % absolute accuracy.
- 41 % token reduction on AIME 2024 with +10 % accuracy for the 14B model.
- Insight that much of the verbose reasoning generated by LLMs is not just redundant but actively harmful, propagating errors.
Methodology
- Two‑phase rollout
- Teacher pass: Prompt the same model with an explicit “be concise” instruction and let it generate a short reasoning trace.
- Student pass: Let the model generate its usual (potentially verbose) reasoning without the instruction.
- Reverse KL distillation
- For each token in the student’s rollout, compute the reverse Kullback‑Leibler (KL) divergence between the student’s token distribution and the teacher’s logits (the probability scores from the concise pass).
- Minimize this loss, effectively nudging the student to mimic the teacher’s concise style while still preserving the original answer.
- On‑policy learning
- Both teacher and student are the same model, so the distillation stays “on‑policy” (no external teacher model needed).
- No extra supervision
- The process does not require ground‑truth solutions, token‑budget hyper‑parameters, or a separate difficulty estimator; the model’s own concise output serves as the supervision signal.
Results & Findings
| Benchmark | Model | Token Reduction | Accuracy Δ (abs.) |
|---|---|---|---|
| MATH‑500 | Qwen‑3‑8B | ~57 % | +9 pts |
| MATH‑500 | Qwen‑3‑14B | ~59 % | +16 pts |
| AIME 2024 | Qwen‑3‑14B | ~41 % | +10 pts |
- Compression is adaptive: easy questions see the biggest cuts, while harder ones keep more tokens, preserving the depth of reasoning where it matters.
- Error propagation is mitigated: by removing unnecessary steps, the model makes fewer opportunities to introduce mistakes, leading to higher final scores.
- Training cost is modest: because the teacher and student are the same model, the extra compute is limited to a second forward pass per example.
Practical Implications
- Faster inference & lower cost – Shorter token sequences mean reduced API usage, lower latency, and cheaper compute for any service that relies on LLM reasoning (e.g., tutoring bots, code‑assistants, scientific QA).
- Cleaner logs for debugging – Concise reasoning traces are easier for engineers to inspect, audit, and fine‑tune.
- Improved reliability in downstream pipelines – When LLM outputs feed into other systems (e.g., automated grading, symbolic solvers), fewer spurious tokens reduce the chance of downstream parsing errors.
- Plug‑and‑play – OPSDC can be added as a post‑training fine‑tuning step to any existing decoder‑only model without needing a separate teacher model or curated datasets.
Limitations & Future Work
- The approach still depends on the model’s ability to follow the “be concise” instruction; very small or poorly calibrated models may not generate useful teacher traces.
- Reverse KL distillation focuses on matching teacher logits token‑wise, which may overlook higher‑level structural aspects of reasoning (e.g., logical flow).
- The paper evaluates primarily on math benchmarks; extending to other reasoning domains (code generation, commonsense QA) remains an open question.
- Future research could explore curriculum‑style self‑distillation, where the conciseness instruction is gradually relaxed, or combine OPSDC with external knowledge‑bases to further boost correctness.
Authors
- Hejian Sang
- Yuanda Xu
- Zhengze Zhou
- Ran He
- Zhipeng Wang
- Jiachen Sun
Paper Information
- arXiv ID: 2603.05433v1
- Categories: cs.LG
- Published: March 5, 2026
- PDF: Download PDF