[Paper] On-Policy Context Distillation for Language Models
Source: arXiv - 2602.12275v1
Overview
The paper introduces On‑Policy Context Distillation (OPCD), a new way to teach language models to “remember” useful knowledge that they normally only see in the prompt. By letting a model learn from its own generated outputs while being guided by a teacher that has access to richer context, OPCD lets smaller or less‑trained models internalize both factual and procedural know‑how without sacrificing their ability to handle novel inputs.
Key Contributions
- On‑Policy Distillation for LMs – Merges the classic on‑policy RL distillation idea with context‑based teaching, training the student on its own trajectories rather than static datasets.
- Reverse KL Objective – Uses a reverse Kullback‑Leibler loss to align the student’s distribution with a context‑conditioned teacher, encouraging the student to adopt the teacher’s “thought process.”
- Experiential Knowledge Distillation – Shows how models can extract and consolidate reusable knowledge from their own past solution traces (e.g., previous math steps, game moves).
- System Prompt Distillation – Demonstrates that optimized prompts (often hand‑crafted or discovered via prompt‑engineering) can be baked into the model weights, removing the need for external prompting at inference time.
- Cross‑Size Distillation – Validates that a compact student can inherit experiential knowledge from a much larger teacher, enabling efficient model deployment.
- Broad Empirical Coverage – Benchmarks across mathematical reasoning, text‑based games, and domain‑specific tasks, consistently beating strong baselines while preserving out‑of‑distribution (OOD) performance.
Methodology
- Teacher & Student Setup – The teacher model receives the full context (e.g., a prompt plus any external knowledge) and generates a probability distribution over the next token. The student only sees the prompt (no extra context).
- On‑Policy Trajectory Generation – The student samples its own output sequences (its “policy”) on the training data. These self‑generated trajectories become the training examples.
- Reverse KL Distillation – For each student‑generated token, the loss is the reverse KL divergence
KL(teacher || student). This pushes the student to increase probability mass on tokens the teacher deems likely, effectively teaching the student to mimic the teacher’s reasoning given the missing context. - Iterative Refinement – The process repeats: the student improves, generates better trajectories, and the teacher (fixed or slowly updated) continues to provide the contextual guidance.
- Applications –
- Experiential Knowledge: The teacher is a version of the model that has access to its own historical solution traces; the student learns to embed those traces into its parameters.
- System Prompt: The teacher is prompted with an engineered prompt that yields desirable behavior; the student learns to reproduce that behavior without the prompt.
The whole pipeline is lightweight: it requires only forward passes through the teacher and student, no external reward models, and can be run on standard GPU clusters.
Results & Findings
| Task | Baseline (e.g., standard fine‑tuning) | OPCD | Accuracy Δ | OOD Retention |
|---|---|---|---|---|
| Math reasoning (MATH) | 71.2% | 78.5% | +7.3 pts | No drop (≈71% vs 71.2%) |
| Text‑based game (Jericho) | 62.4% | 68.9% | +6.5 pts | Slight improvement |
| Domain‑specific QA (Legal) | 68.0% | 74.3% | +6.3 pts | Maintained 66% vs 68% baseline |
- Cross‑size distillation: A 1.3B student distilled from a 13B teacher achieved 75% of the teacher’s performance on the math benchmark, while a vanilla 1.3B model lagged at 62%.
- Prompt‑free inference: After system‑prompt distillation, the student matched the teacher’s prompt‑augmented performance without needing the prompt at runtime, cutting inference latency by ~30%.
- OOD robustness: Unlike aggressive fine‑tuning, OPCD preserved the model’s ability to answer unrelated queries, confirming that the distilled knowledge integrates rather than overwrites existing capabilities.
Practical Implications
- Smaller Deployments: Companies can ship compact models that still carry the “experience” of larger, more expensive systems—useful for edge devices, mobile apps, or cost‑sensitive SaaS.
- Prompt‑Engineering Savings: Once a high‑performing prompt is discovered (often via costly RLHF or manual tuning), OPCD can bake that behavior into the model, eliminating runtime prompt handling and reducing latency.
- Continuous Learning Pipelines: Teams can let production models log their own solution traces (e.g., bug‑fix suggestions, code completions) and periodically run OPCD to internalize successful patterns, creating a self‑improving loop without external data curation.
- Domain Adaptation: For regulated industries (finance, healthcare, law), OPCD offers a way to embed proprietary knowledge bases into the model while keeping the base model’s general language abilities intact.
- Simplified Inference Stack: By removing the need for external context (prompts, retrieval modules), OPCD streamlines the inference architecture, easing scaling and monitoring.
Limitations & Future Work
- Teacher Dependence: The quality of distilled knowledge hinges on the teacher’s context handling; a poorly engineered prompt or noisy historical traces can propagate errors.
- Computational Overhead: Generating on‑policy trajectories for large datasets can be costly, though still cheaper than full RLHF pipelines.
- Scope of Knowledge Transfer: OPCD excels at procedural or prompt‑driven behaviors but may struggle with highly factual, encyclopedic knowledge that requires external grounding.
- Future Directions: The authors suggest exploring multi‑teacher ensembles, adaptive KL weighting to balance preservation vs. acquisition, and integrating retrieval‑augmented generation to broaden the range of distillable knowledge.
Bottom line: On‑Policy Context Distillation offers a pragmatic bridge between the flexibility of prompting and the efficiency of compact, self‑contained models—making it a compelling tool for developers looking to embed expertise directly into their language‑model services.
Authors
- Tianzhu Ye
- Li Dong
- Xun Wu
- Shaohan Huang
- Furu Wei
Paper Information
- arXiv ID: 2602.12275v1
- Categories: cs.CL
- Published: February 12, 2026
- PDF: Download PDF