[Paper] On-Policy Context Distillation for Language Models

Published: (February 12, 2026 at 01:58 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.12275v1

Overview

The paper introduces On‑Policy Context Distillation (OPCD), a new way to teach language models to “remember” useful knowledge that they normally only see in the prompt. By letting a model learn from its own generated outputs while being guided by a teacher that has access to richer context, OPCD lets smaller or less‑trained models internalize both factual and procedural know‑how without sacrificing their ability to handle novel inputs.

Key Contributions

  • On‑Policy Distillation for LMs – Merges the classic on‑policy RL distillation idea with context‑based teaching, training the student on its own trajectories rather than static datasets.
  • Reverse KL Objective – Uses a reverse Kullback‑Leibler loss to align the student’s distribution with a context‑conditioned teacher, encouraging the student to adopt the teacher’s “thought process.”
  • Experiential Knowledge Distillation – Shows how models can extract and consolidate reusable knowledge from their own past solution traces (e.g., previous math steps, game moves).
  • System Prompt Distillation – Demonstrates that optimized prompts (often hand‑crafted or discovered via prompt‑engineering) can be baked into the model weights, removing the need for external prompting at inference time.
  • Cross‑Size Distillation – Validates that a compact student can inherit experiential knowledge from a much larger teacher, enabling efficient model deployment.
  • Broad Empirical Coverage – Benchmarks across mathematical reasoning, text‑based games, and domain‑specific tasks, consistently beating strong baselines while preserving out‑of‑distribution (OOD) performance.

Methodology

  1. Teacher & Student Setup – The teacher model receives the full context (e.g., a prompt plus any external knowledge) and generates a probability distribution over the next token. The student only sees the prompt (no extra context).
  2. On‑Policy Trajectory Generation – The student samples its own output sequences (its “policy”) on the training data. These self‑generated trajectories become the training examples.
  3. Reverse KL Distillation – For each student‑generated token, the loss is the reverse KL divergence KL(teacher || student). This pushes the student to increase probability mass on tokens the teacher deems likely, effectively teaching the student to mimic the teacher’s reasoning given the missing context.
  4. Iterative Refinement – The process repeats: the student improves, generates better trajectories, and the teacher (fixed or slowly updated) continues to provide the contextual guidance.
  5. Applications
    • Experiential Knowledge: The teacher is a version of the model that has access to its own historical solution traces; the student learns to embed those traces into its parameters.
    • System Prompt: The teacher is prompted with an engineered prompt that yields desirable behavior; the student learns to reproduce that behavior without the prompt.

The whole pipeline is lightweight: it requires only forward passes through the teacher and student, no external reward models, and can be run on standard GPU clusters.

Results & Findings

TaskBaseline (e.g., standard fine‑tuning)OPCDAccuracy ΔOOD Retention
Math reasoning (MATH)71.2%78.5%+7.3 ptsNo drop (≈71% vs 71.2%)
Text‑based game (Jericho)62.4%68.9%+6.5 ptsSlight improvement
Domain‑specific QA (Legal)68.0%74.3%+6.3 ptsMaintained 66% vs 68% baseline
  • Cross‑size distillation: A 1.3B student distilled from a 13B teacher achieved 75% of the teacher’s performance on the math benchmark, while a vanilla 1.3B model lagged at 62%.
  • Prompt‑free inference: After system‑prompt distillation, the student matched the teacher’s prompt‑augmented performance without needing the prompt at runtime, cutting inference latency by ~30%.
  • OOD robustness: Unlike aggressive fine‑tuning, OPCD preserved the model’s ability to answer unrelated queries, confirming that the distilled knowledge integrates rather than overwrites existing capabilities.

Practical Implications

  • Smaller Deployments: Companies can ship compact models that still carry the “experience” of larger, more expensive systems—useful for edge devices, mobile apps, or cost‑sensitive SaaS.
  • Prompt‑Engineering Savings: Once a high‑performing prompt is discovered (often via costly RLHF or manual tuning), OPCD can bake that behavior into the model, eliminating runtime prompt handling and reducing latency.
  • Continuous Learning Pipelines: Teams can let production models log their own solution traces (e.g., bug‑fix suggestions, code completions) and periodically run OPCD to internalize successful patterns, creating a self‑improving loop without external data curation.
  • Domain Adaptation: For regulated industries (finance, healthcare, law), OPCD offers a way to embed proprietary knowledge bases into the model while keeping the base model’s general language abilities intact.
  • Simplified Inference Stack: By removing the need for external context (prompts, retrieval modules), OPCD streamlines the inference architecture, easing scaling and monitoring.

Limitations & Future Work

  • Teacher Dependence: The quality of distilled knowledge hinges on the teacher’s context handling; a poorly engineered prompt or noisy historical traces can propagate errors.
  • Computational Overhead: Generating on‑policy trajectories for large datasets can be costly, though still cheaper than full RLHF pipelines.
  • Scope of Knowledge Transfer: OPCD excels at procedural or prompt‑driven behaviors but may struggle with highly factual, encyclopedic knowledge that requires external grounding.
  • Future Directions: The authors suggest exploring multi‑teacher ensembles, adaptive KL weighting to balance preservation vs. acquisition, and integrating retrieval‑augmented generation to broaden the range of distillable knowledge.

Bottom line: On‑Policy Context Distillation offers a pragmatic bridge between the flexibility of prompting and the efficiency of compact, self‑contained models—making it a compelling tool for developers looking to embed expertise directly into their language‑model services.

Authors

  • Tianzhu Ye
  • Li Dong
  • Xun Wu
  • Shaohan Huang
  • Furu Wei

Paper Information

  • arXiv ID: 2602.12275v1
  • Categories: cs.CL
  • Published: February 12, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »