[Paper] On-Policy Self-Distillation for Reasoning Compression

Published: (March 5, 2026 at 12:54 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2603.05433v1

Overview

Large language models (LLMs) that solve math or logic problems often “think out loud,” generating long chains of reasoning that contain a lot of filler and even harmful noise. The paper On‑Policy Self‑Distillation for Reasoning Compression (OPSDC) proposes a surprisingly simple trick: let the model teach itself to be concise by distilling its own “be concise” outputs back into its normal generation process. The result is dramatically shorter reasoning traces—up to 60 % fewer tokens—while boosting accuracy on challenging math benchmarks.

Key Contributions

  • Self‑distillation pipeline that requires no external labels, token‑budget constraints, or difficulty estimators.
  • Dynamic compression: easy problems are aggressively shortened, hard problems retain necessary deliberation.
  • Empirical gains on state‑of‑the‑art LLMs (Qwen‑3 8B/14B):
    • 57‑59 % token reduction on MATH‑500 with +9‑16 % absolute accuracy.
    • 41 % token reduction on AIME 2024 with +10 % accuracy for the 14B model.
  • Insight that much of the verbose reasoning generated by LLMs is not just redundant but actively harmful, propagating errors.

Methodology

  1. Two‑phase rollout
    • Teacher pass: Prompt the same model with an explicit “be concise” instruction and let it generate a short reasoning trace.
    • Student pass: Let the model generate its usual (potentially verbose) reasoning without the instruction.
  2. Reverse KL distillation
    • For each token in the student’s rollout, compute the reverse Kullback‑Leibler (KL) divergence between the student’s token distribution and the teacher’s logits (the probability scores from the concise pass).
    • Minimize this loss, effectively nudging the student to mimic the teacher’s concise style while still preserving the original answer.
  3. On‑policy learning
    • Both teacher and student are the same model, so the distillation stays “on‑policy” (no external teacher model needed).
  4. No extra supervision
    • The process does not require ground‑truth solutions, token‑budget hyper‑parameters, or a separate difficulty estimator; the model’s own concise output serves as the supervision signal.

Results & Findings

BenchmarkModelToken ReductionAccuracy Δ (abs.)
MATH‑500Qwen‑3‑8B~57 %+9 pts
MATH‑500Qwen‑3‑14B~59 %+16 pts
AIME 2024Qwen‑3‑14B~41 %+10 pts
  • Compression is adaptive: easy questions see the biggest cuts, while harder ones keep more tokens, preserving the depth of reasoning where it matters.
  • Error propagation is mitigated: by removing unnecessary steps, the model makes fewer opportunities to introduce mistakes, leading to higher final scores.
  • Training cost is modest: because the teacher and student are the same model, the extra compute is limited to a second forward pass per example.

Practical Implications

  • Faster inference & lower cost – Shorter token sequences mean reduced API usage, lower latency, and cheaper compute for any service that relies on LLM reasoning (e.g., tutoring bots, code‑assistants, scientific QA).
  • Cleaner logs for debugging – Concise reasoning traces are easier for engineers to inspect, audit, and fine‑tune.
  • Improved reliability in downstream pipelines – When LLM outputs feed into other systems (e.g., automated grading, symbolic solvers), fewer spurious tokens reduce the chance of downstream parsing errors.
  • Plug‑and‑play – OPSDC can be added as a post‑training fine‑tuning step to any existing decoder‑only model without needing a separate teacher model or curated datasets.

Limitations & Future Work

  • The approach still depends on the model’s ability to follow the “be concise” instruction; very small or poorly calibrated models may not generate useful teacher traces.
  • Reverse KL distillation focuses on matching teacher logits token‑wise, which may overlook higher‑level structural aspects of reasoning (e.g., logical flow).
  • The paper evaluates primarily on math benchmarks; extending to other reasoning domains (code generation, commonsense QA) remains an open question.
  • Future research could explore curriculum‑style self‑distillation, where the conciseness instruction is gradually relaxed, or combine OPSDC with external knowledge‑bases to further boost correctness.

Authors

  • Hejian Sang
  • Yuanda Xu
  • Zhengze Zhou
  • Ran He
  • Zhipeng Wang
  • Jiachen Sun

Paper Information

  • arXiv ID: 2603.05433v1
  • Categories: cs.LG
  • Published: March 5, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »