[Paper] Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning

Published: 3 days ago (February 24, 2026 at 12:03 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.21103v1

Overview

The paper proposes Prompt‑Level Distillation (PLD), a non‑parametric technique that transfers reasoning capabilities from a large “teacher” LLM to a much smaller “student” model by encoding the teacher’s chain‑of‑thought logic into a set of expressive system‑prompt instructions. PLD delivers near‑state‑of‑the‑art accuracy on reasoning benchmarks while keeping inference latency and hardware requirements low enough for edge devices and high‑throughput services.

Key Contributions

Non‑parametric distillation: Instead of fine‑tuning model weights, PLD extracts reasoning patterns as natural‑language instructions, preserving the student model’s original parameters.
Compact reasoning prompt: The distilled instruction list replaces costly chain‑of‑thought prompting, yielding negligible extra latency.
Strong empirical gains: On StereoSet and Contract‑NLI, a 4 B‑parameter Gemma‑3 model jumps from 57 % → 90 % and 67 % → 83 % macro‑F1, respectively.
Interpretability by design: The instruction set is human‑readable, enabling full auditability of the model’s decision logic—crucial for regulated domains.
Zero‑training overhead: PLD requires only a single pass over teacher outputs, avoiding the compute‑intensive fine‑tuning pipeline.

Methodology

Teacher reasoning extraction – A large, high‑performing LLM (the “teacher”) solves a set of labeled examples using chain‑of‑thought prompting. Its step‑by‑step rationales are collected.
Pattern mining & abstraction – The rationales are parsed to identify recurring logical constructs (e.g., “if X contains Y, then …”, “compare numeric values”, “lookup definition”). These constructs are generalized into concise natural‑language instructions.
System‑prompt assembly – The distilled instructions are concatenated into a single system prompt that is fed to the student model before any user query. The prompt acts as a static “reasoning engine” that the student follows when generating answers.
Inference – At test time the student receives the user query plus the pre‑computed system prompt; no additional chain‑of‑thought steps are needed, so inference is a single forward pass.

The process is fully non‑parametric: the student’s weights stay unchanged, and the only “model‑specific” artifact is the prompt text.

Results & Findings

Dataset	Teacher (CoT)	Student (Gemma‑3 4B) – Baseline	Student + PLD	Macro‑F1 ↑
StereoSet	94 %	57 %	90 %	+33 pp
Contract‑NLI	88 %	67 %	83 %	+16 pp

Latency: Adding the PLD prompt adds < 5 ms overhead on a typical CPU inference, compared to > 200 ms extra for full chain‑of‑thought generation.
Parameter efficiency: The 4 B model with PLD matches or exceeds the performance of 13 B‑plus models that rely on CoT prompting.
Transparency: Human reviewers could read the distilled instruction list and verify that each decision aligns with the intended logical flow, something that is opaque in standard fine‑tuned models.

Practical Implications

Edge & low‑resource deployment: Developers can ship a 4 B model to mobile or IoT devices and still achieve high‑quality reasoning without the memory/compute budget of a giant LLM.
Regulated industries: The human‑readable prompt satisfies audit requirements for law, finance, and content moderation, enabling “explain‑by‑prompt” compliance checks.
High‑throughput services: SaaS platforms can serve millions of requests per second with a single forward pass per query, dramatically cutting cloud‑GPU costs.
Rapid domain adaptation: Updating the reasoning logic is as simple as editing the instruction list—no retraining, no hyper‑parameter tuning, and no risk of catastrophic forgetting.

Limitations & Future Work

Prompt length constraints: Very complex domains may require longer instruction sets that approach model context limits, potentially necessitating prompt‑compression techniques.
Teacher quality dependence: The distilled logic is only as good as the teacher’s chain‑of‑thought outputs; systematic teacher errors can propagate into the prompt.
Generalization to unseen tasks: PLD has been evaluated on two reasoning benchmarks; broader validation on diverse NLP tasks (e.g., multi‑hop QA, code generation) is needed.
Automation of pattern mining: Current extraction relies on heuristic parsing; future work could explore learned or LLM‑assisted pattern discovery to reduce manual effort.

Prompt‑Level Distillation offers a pragmatic middle ground between heavyweight fine‑tuning and costly chain‑of‑thought prompting, giving developers a tool to unlock strong reasoning in compact models while keeping the process transparent and operationally lightweight.

Authors

Sanket Badhe
Deep Shah

Paper Information

arXiv ID: 2602.21103v1
Categories: cs.CL, cs.IR
Published: February 24, 2026
PDF: Download PDF

[Paper] Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning

[Paper] LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

[Paper] A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations

[Paper] SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables