[Paper] Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning
Source: arXiv - 2602.21103v1
Overview
The paper proposes Prompt‑Level Distillation (PLD), a non‑parametric technique that transfers reasoning capabilities from a large “teacher” LLM to a much smaller “student” model by encoding the teacher’s chain‑of‑thought logic into a set of expressive system‑prompt instructions. PLD delivers near‑state‑of‑the‑art accuracy on reasoning benchmarks while keeping inference latency and hardware requirements low enough for edge devices and high‑throughput services.
Key Contributions
- Non‑parametric distillation: Instead of fine‑tuning model weights, PLD extracts reasoning patterns as natural‑language instructions, preserving the student model’s original parameters.
- Compact reasoning prompt: The distilled instruction list replaces costly chain‑of‑thought prompting, yielding negligible extra latency.
- Strong empirical gains: On StereoSet and Contract‑NLI, a 4 B‑parameter Gemma‑3 model jumps from 57 % → 90 % and 67 % → 83 % macro‑F1, respectively.
- Interpretability by design: The instruction set is human‑readable, enabling full auditability of the model’s decision logic—crucial for regulated domains.
- Zero‑training overhead: PLD requires only a single pass over teacher outputs, avoiding the compute‑intensive fine‑tuning pipeline.
Methodology
- Teacher reasoning extraction – A large, high‑performing LLM (the “teacher”) solves a set of labeled examples using chain‑of‑thought prompting. Its step‑by‑step rationales are collected.
- Pattern mining & abstraction – The rationales are parsed to identify recurring logical constructs (e.g., “if X contains Y, then …”, “compare numeric values”, “lookup definition”). These constructs are generalized into concise natural‑language instructions.
- System‑prompt assembly – The distilled instructions are concatenated into a single system prompt that is fed to the student model before any user query. The prompt acts as a static “reasoning engine” that the student follows when generating answers.
- Inference – At test time the student receives the user query plus the pre‑computed system prompt; no additional chain‑of‑thought steps are needed, so inference is a single forward pass.
The process is fully non‑parametric: the student’s weights stay unchanged, and the only “model‑specific” artifact is the prompt text.
Results & Findings
| Dataset | Teacher (CoT) | Student (Gemma‑3 4B) – Baseline | Student + PLD | Macro‑F1 ↑ |
|---|---|---|---|---|
| StereoSet | 94 % | 57 % | 90 % | +33 pp |
| Contract‑NLI | 88 % | 67 % | 83 % | +16 pp |
- Latency: Adding the PLD prompt adds < 5 ms overhead on a typical CPU inference, compared to > 200 ms extra for full chain‑of‑thought generation.
- Parameter efficiency: The 4 B model with PLD matches or exceeds the performance of 13 B‑plus models that rely on CoT prompting.
- Transparency: Human reviewers could read the distilled instruction list and verify that each decision aligns with the intended logical flow, something that is opaque in standard fine‑tuned models.
Practical Implications
- Edge & low‑resource deployment: Developers can ship a 4 B model to mobile or IoT devices and still achieve high‑quality reasoning without the memory/compute budget of a giant LLM.
- Regulated industries: The human‑readable prompt satisfies audit requirements for law, finance, and content moderation, enabling “explain‑by‑prompt” compliance checks.
- High‑throughput services: SaaS platforms can serve millions of requests per second with a single forward pass per query, dramatically cutting cloud‑GPU costs.
- Rapid domain adaptation: Updating the reasoning logic is as simple as editing the instruction list—no retraining, no hyper‑parameter tuning, and no risk of catastrophic forgetting.
Limitations & Future Work
- Prompt length constraints: Very complex domains may require longer instruction sets that approach model context limits, potentially necessitating prompt‑compression techniques.
- Teacher quality dependence: The distilled logic is only as good as the teacher’s chain‑of‑thought outputs; systematic teacher errors can propagate into the prompt.
- Generalization to unseen tasks: PLD has been evaluated on two reasoning benchmarks; broader validation on diverse NLP tasks (e.g., multi‑hop QA, code generation) is needed.
- Automation of pattern mining: Current extraction relies on heuristic parsing; future work could explore learned or LLM‑assisted pattern discovery to reduce manual effort.
Prompt‑Level Distillation offers a pragmatic middle ground between heavyweight fine‑tuning and costly chain‑of‑thought prompting, giving developers a tool to unlock strong reasoning in compact models while keeping the process transparent and operationally lightweight.
Authors
- Sanket Badhe
- Deep Shah
Paper Information
- arXiv ID: 2602.21103v1
- Categories: cs.CL, cs.IR
- Published: February 24, 2026
- PDF: Download PDF