[Paper] LLMs can Compress LLMs: Adaptive Pruning by Agents

Published: 3 weeks ago (January 14, 2026 at 01:45 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.09694v1

Overview

The paper proposes a novel “agent‑guided” pruning technique that lets one large language model (LLM) act as a smart controller to compress another LLM. By using an LLM‑based agent to decide where and how much to prune, the authors achieve high sparsity (≈45 %) while preserving – and even improving – downstream performance on benchmarks such as MMLU and factual QA. The approach works without any retraining, making it attractive for developers who need lighter models for production.

Key Contributions

Agent‑guided pruning: Introduces a foundation model that iteratively selects per‑layer sparsity ratios, replacing hand‑crafted heuristics.
Hybrid sensitivity metric: Combines Wanda‑style weight‑activation scores with gradient‑based importance, normalized as z‑scores for cross‑layer comparison.
Self‑reflection & rollback: The pruning agent evaluates perplexity after each iteration, rolls back if degradation exceeds a threshold, and learns from past decisions.
Model‑agnostic, training‑free: Works on any decoder‑only LLM (demonstrated on Qwen‑3 4B/8B) without additional fine‑tuning.
Empirical gains: At ~45 % sparsity, achieves 56 % relative MMLU accuracy improvement, 19× better factual knowledge retention on FreebaseQA, and 69 % lower perplexity drop versus structured baselines.

Methodology

Sensitivity profiling – For every layer, the method computes two scores:
- Wanda‑inspired weight‑activation magnitude (captures how much a weight contributes to activations).
- Gradient importance (how much the loss would change if the weight were removed).
  These scores are turned into z‑scores so they can be compared across layers.
LLM pruning agent – A separate LLM (the “agent”) receives the per‑layer z‑score table and a short prompt describing the current pruning state. It then outputs a sparsity ratio for each layer. The agent is equipped with a self‑reflection loop: after pruning, the target model’s perplexity on a validation set is measured; if the drop exceeds a preset threshold, the system rolls back to the previous checkpoint and the agent revises its recommendation.
Iterative pruning – The process repeats for 21–40 iterations. Each iteration prunes a small fraction of weights, evaluates, and possibly rolls back. Over time the agent “learns” which layers tolerate aggressive pruning and which need to stay dense.
No retraining – The final sparse model is ready for inference directly after the pruning loop; no additional fine‑tuning or knowledge distillation is performed.

Results & Findings

Metric	Structured baseline (e.g., Wanda)	Agent‑guided pruning
Sparsity	~45 %	~45 % (same)
MMLU accuracy	Baseline	+56 % relative improvement
FreebaseQA factual recall	Near‑total collapse	19× better retention
Perplexity degradation	Large drop	69 % lower degradation
Rollbacks needed	N/A (static)	2–4 rollbacks across all iterations

The agent consistently identifies “knowledge‑critical” layers (often early transformer blocks) and spares them, while aggressively pruning layers that contribute less to factual reasoning. The self‑reflection mechanism prevents catastrophic loss of language modeling ability, keeping perplexity within acceptable bounds.

Practical Implications

Deployable lightweight LLMs: Companies can shrink 4‑8 B‑parameter models to ~45 % sparsity without costly retraining pipelines, reducing GPU memory and latency for edge or low‑cost cloud inference.
Preserved factual competence: Unlike many structured pruning methods, this approach maintains the model’s ability to answer knowledge‑heavy queries, crucial for chatbots, retrieval‑augmented generation, and decision‑support tools.
Plug‑and‑play compression service: Because the method is model‑agnostic, a SaaS offering could accept any compatible decoder‑only LLM, run the agent‑guided pruning loop, and return a ready‑to‑serve sparse checkpoint.
Reduced engineering overhead: The rollback/self‑reflection loop automates hyper‑parameter tuning (how much to prune per layer), freeing developers from manual sparsity budgeting.
Foundation‑model‑as‑tool: Demonstrates a concrete use‑case where a powerful LLM can act as an optimizer for other models, opening doors to meta‑learning pipelines (e.g., agents that also suggest quantization or distillation strategies).

Limitations & Future Work

Scope limited to decoder‑only LLMs: The paper evaluates only Qwen‑3 4B/8B; applicability to encoder‑decoder or multimodal models remains untested.
Agent size not quantified: The pruning agent itself is an LLM; the overhead of running the agent during compression is not discussed in depth.
Heuristic thresholds: The perplexity rollback threshold is manually set; adaptive or learned thresholds could improve robustness.
Knowledge‑type bias: While factual QA improves, the impact on other tasks (e.g., reasoning, code generation) needs further study.
Future directions include extending the framework to multi‑objective pruning (e.g., balancing latency, memory, and accuracy), integrating quantization, and exploring self‑supervised agent training to reduce reliance on a separate foundation model.

Authors

Sai Varun Kodathala
Rakesh Vunnam

Paper Information

arXiv ID: 2601.09694v1
Categories: cs.CL, cs.AI, cs.CV
Published: January 14, 2026
PDF: Download PDF

[Paper] LLMs can Compress LLMs: Adaptive Pruning by Agents

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models

[Paper] PubMed-OCR: PMC Open Access OCR Annotations

[Paper] Do explanations generalize across large reasoning models?

[Paper] Building Production-Ready Probes For Gemini