[Paper] On-Policy Context Distillation for Language Models

Published: 2 months ago (February 12, 2026 at 01:58 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.12275v1

Overview

The paper introduces On‑Policy Context Distillation (OPCD), a new way to teach language models to “remember” useful knowledge that they normally only see in the prompt. By letting a model learn from its own generated outputs while being guided by a teacher that has access to richer context, OPCD lets smaller or less‑trained models internalize both factual and procedural know‑how without sacrificing their ability to handle novel inputs.

Key Contributions

On‑Policy Distillation for LMs – Merges the classic on‑policy RL distillation idea with context‑based teaching, training the student on its own trajectories rather than static datasets.
Reverse KL Objective – Uses a reverse Kullback‑Leibler loss to align the student’s distribution with a context‑conditioned teacher, encouraging the student to adopt the teacher’s “thought process.”
Experiential Knowledge Distillation – Shows how models can extract and consolidate reusable knowledge from their own past solution traces (e.g., previous math steps, game moves).
System Prompt Distillation – Demonstrates that optimized prompts (often hand‑crafted or discovered via prompt‑engineering) can be baked into the model weights, removing the need for external prompting at inference time.
Cross‑Size Distillation – Validates that a compact student can inherit experiential knowledge from a much larger teacher, enabling efficient model deployment.
Broad Empirical Coverage – Benchmarks across mathematical reasoning, text‑based games, and domain‑specific tasks, consistently beating strong baselines while preserving out‑of‑distribution (OOD) performance.

Methodology

Teacher & Student Setup – The teacher model receives the full context (e.g., a prompt plus any external knowledge) and generates a probability distribution over the next token. The student only sees the prompt (no extra context).
On‑Policy Trajectory Generation – The student samples its own output sequences (its “policy”) on the training data. These self‑generated trajectories become the training examples.
Reverse KL Distillation – For each student‑generated token, the loss is the reverse KL divergence KL(teacher || student). This pushes the student to increase probability mass on tokens the teacher deems likely, effectively teaching the student to mimic the teacher’s reasoning given the missing context.
Iterative Refinement – The process repeats: the student improves, generates better trajectories, and the teacher (fixed or slowly updated) continues to provide the contextual guidance.
Applications –
- Experiential Knowledge: The teacher is a version of the model that has access to its own historical solution traces; the student learns to embed those traces into its parameters.
- System Prompt: The teacher is prompted with an engineered prompt that yields desirable behavior; the student learns to reproduce that behavior without the prompt.

The whole pipeline is lightweight: it requires only forward passes through the teacher and student, no external reward models, and can be run on standard GPU clusters.

Results & Findings

Task	Baseline (e.g., standard fine‑tuning)	OPCD	Accuracy Δ	OOD Retention
Math reasoning (MATH)	71.2%	78.5%	+7.3 pts	No drop (≈71% vs 71.2%)
Text‑based game (Jericho)	62.4%	68.9%	+6.5 pts	Slight improvement
Domain‑specific QA (Legal)	68.0%	74.3%	+6.3 pts	Maintained 66% vs 68% baseline

Cross‑size distillation: A 1.3B student distilled from a 13B teacher achieved 75% of the teacher’s performance on the math benchmark, while a vanilla 1.3B model lagged at 62%.
Prompt‑free inference: After system‑prompt distillation, the student matched the teacher’s prompt‑augmented performance without needing the prompt at runtime, cutting inference latency by ~30%.
OOD robustness: Unlike aggressive fine‑tuning, OPCD preserved the model’s ability to answer unrelated queries, confirming that the distilled knowledge integrates rather than overwrites existing capabilities.

Practical Implications

Smaller Deployments: Companies can ship compact models that still carry the “experience” of larger, more expensive systems—useful for edge devices, mobile apps, or cost‑sensitive SaaS.
Prompt‑Engineering Savings: Once a high‑performing prompt is discovered (often via costly RLHF or manual tuning), OPCD can bake that behavior into the model, eliminating runtime prompt handling and reducing latency.
Continuous Learning Pipelines: Teams can let production models log their own solution traces (e.g., bug‑fix suggestions, code completions) and periodically run OPCD to internalize successful patterns, creating a self‑improving loop without external data curation.
Domain Adaptation: For regulated industries (finance, healthcare, law), OPCD offers a way to embed proprietary knowledge bases into the model while keeping the base model’s general language abilities intact.
Simplified Inference Stack: By removing the need for external context (prompts, retrieval modules), OPCD streamlines the inference architecture, easing scaling and monitoring.

Limitations & Future Work

Teacher Dependence: The quality of distilled knowledge hinges on the teacher’s context handling; a poorly engineered prompt or noisy historical traces can propagate errors.
Computational Overhead: Generating on‑policy trajectories for large datasets can be costly, though still cheaper than full RLHF pipelines.
Scope of Knowledge Transfer: OPCD excels at procedural or prompt‑driven behaviors but may struggle with highly factual, encyclopedic knowledge that requires external grounding.
Future Directions: The authors suggest exploring multi‑teacher ensembles, adaptive KL weighting to balance preservation vs. acquisition, and integrating retrieval‑augmented generation to broaden the range of distillable knowledge.

Bottom line: On‑Policy Context Distillation offers a pragmatic bridge between the flexibility of prompting and the efficiency of compact, self‑contained models—making it a compelling tool for developers looking to embed expertise directly into their language‑model services.

Authors

Tianzhu Ye
Li Dong
Xun Wu
Shaohan Huang
Furu Wei

Paper Information

arXiv ID: 2602.12275v1
Categories: cs.CL
Published: February 12, 2026
PDF: Download PDF

[Paper] On-Policy Context Distillation for Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Semantic Chunking and the Entropy of Natural Language

[Paper] CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

[Paper] Quantization-Robust LLM Unlearning via Low-Rank Adaptation

[Paper] OpenLID-v3: Improving the Precision of Closely Related Language Identification -- An Experience Report