[Paper] Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning

Published: (February 24, 2026 at 12:03 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.21103v1

Overview

The paper proposes Prompt‑Level Distillation (PLD), a non‑parametric technique that transfers reasoning capabilities from a large “teacher” LLM to a much smaller “student” model by encoding the teacher’s chain‑of‑thought logic into a set of expressive system‑prompt instructions. PLD delivers near‑state‑of‑the‑art accuracy on reasoning benchmarks while keeping inference latency and hardware requirements low enough for edge devices and high‑throughput services.

Key Contributions

  • Non‑parametric distillation: Instead of fine‑tuning model weights, PLD extracts reasoning patterns as natural‑language instructions, preserving the student model’s original parameters.
  • Compact reasoning prompt: The distilled instruction list replaces costly chain‑of‑thought prompting, yielding negligible extra latency.
  • Strong empirical gains: On StereoSet and Contract‑NLI, a 4 B‑parameter Gemma‑3 model jumps from 57 % → 90 % and 67 % → 83 % macro‑F1, respectively.
  • Interpretability by design: The instruction set is human‑readable, enabling full auditability of the model’s decision logic—crucial for regulated domains.
  • Zero‑training overhead: PLD requires only a single pass over teacher outputs, avoiding the compute‑intensive fine‑tuning pipeline.

Methodology

  1. Teacher reasoning extraction – A large, high‑performing LLM (the “teacher”) solves a set of labeled examples using chain‑of‑thought prompting. Its step‑by‑step rationales are collected.
  2. Pattern mining & abstraction – The rationales are parsed to identify recurring logical constructs (e.g., “if X contains Y, then …”, “compare numeric values”, “lookup definition”). These constructs are generalized into concise natural‑language instructions.
  3. System‑prompt assembly – The distilled instructions are concatenated into a single system prompt that is fed to the student model before any user query. The prompt acts as a static “reasoning engine” that the student follows when generating answers.
  4. Inference – At test time the student receives the user query plus the pre‑computed system prompt; no additional chain‑of‑thought steps are needed, so inference is a single forward pass.

The process is fully non‑parametric: the student’s weights stay unchanged, and the only “model‑specific” artifact is the prompt text.

Results & Findings

DatasetTeacher (CoT)Student (Gemma‑3 4B) – BaselineStudent + PLDMacro‑F1 ↑
StereoSet94 %57 %90 %+33 pp
Contract‑NLI88 %67 %83 %+16 pp
  • Latency: Adding the PLD prompt adds < 5 ms overhead on a typical CPU inference, compared to > 200 ms extra for full chain‑of‑thought generation.
  • Parameter efficiency: The 4 B model with PLD matches or exceeds the performance of 13 B‑plus models that rely on CoT prompting.
  • Transparency: Human reviewers could read the distilled instruction list and verify that each decision aligns with the intended logical flow, something that is opaque in standard fine‑tuned models.

Practical Implications

  • Edge & low‑resource deployment: Developers can ship a 4 B model to mobile or IoT devices and still achieve high‑quality reasoning without the memory/compute budget of a giant LLM.
  • Regulated industries: The human‑readable prompt satisfies audit requirements for law, finance, and content moderation, enabling “explain‑by‑prompt” compliance checks.
  • High‑throughput services: SaaS platforms can serve millions of requests per second with a single forward pass per query, dramatically cutting cloud‑GPU costs.
  • Rapid domain adaptation: Updating the reasoning logic is as simple as editing the instruction list—no retraining, no hyper‑parameter tuning, and no risk of catastrophic forgetting.

Limitations & Future Work

  • Prompt length constraints: Very complex domains may require longer instruction sets that approach model context limits, potentially necessitating prompt‑compression techniques.
  • Teacher quality dependence: The distilled logic is only as good as the teacher’s chain‑of‑thought outputs; systematic teacher errors can propagate into the prompt.
  • Generalization to unseen tasks: PLD has been evaluated on two reasoning benchmarks; broader validation on diverse NLP tasks (e.g., multi‑hop QA, code generation) is needed.
  • Automation of pattern mining: Current extraction relies on heuristic parsing; future work could explore learned or LLM‑assisted pattern discovery to reduce manual effort.

Prompt‑Level Distillation offers a pragmatic middle ground between heavyweight fine‑tuning and costly chain‑of‑thought prompting, giving developers a tool to unlock strong reasoning in compact models while keeping the process transparent and operationally lightweight.

Authors

  • Sanket Badhe
  • Deep Shah

Paper Information

  • arXiv ID: 2602.21103v1
  • Categories: cs.CL, cs.IR
  • Published: February 24, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »