[Paper] Making Large Language Models Efficient Dense Retrievers

Published: (December 23, 2025 at 01:58 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.20612v1

Overview

Recent research shows that fine‑tuning massive language models (LLMs) as dense retrievers can dramatically boost search quality, but the sheer size of these models makes them costly to run in production. This paper investigates whether the same “layer‑redundancy” tricks that work for generative LLMs also apply to retrieval‑oriented models, and proposes a practical compression pipeline—EffiR—that slashes model size and latency while keeping retrieval performance intact.

Key Contributions

  • Systematic redundancy analysis of LLM‑based dense retrievers, revealing that MLP (feed‑forward) layers are highly prune‑able while attention layers remain essential.
  • EffiR framework that combines a two‑stage compression strategy:
    1. Coarse‑grained depth reduction – dropping entire MLP layers.
    2. Fine‑grained width reduction – shrinking the hidden dimension of the remaining MLPs.
  • Retrieval‑specific fine‑tuning after compression to recover any lost accuracy.
  • Extensive evaluation on the BEIR benchmark across multiple LLM backbones (e.g., LLaMA‑2, Mistral), demonstrating up to 70 % reduction in FLOPs and ≈2× faster inference with ≤1 % drop in nDCG@10.
  • Open‑source implementation and reproducibility scripts, facilitating immediate adoption by the community.

Methodology

  1. Baseline Setup – The authors start from publicly available dense retrievers that fine‑tune a frozen LLM encoder (e.g., LLaMA‑2‑7B) on contrastive retrieval objectives.
  2. Layer‑wise Importance Study – Using ablation (removing one layer at a time) and sensitivity analysis (measuring gradient‑based importance), they quantify how each transformer block contributes to retrieval quality.
  3. Coarse‑to‑Fine Compression
    • Depth reduction: Entire MLP sub‑layers are pruned based on the importance scores, yielding a shallower network.
    • Width reduction: For the remaining MLPs, singular‑value decomposition (SVD) and low‑rank factorization shrink the hidden dimension, preserving most of the learned representation power.
  4. Retrieval‑Specific Fine‑Tuning – After compression, the model is re‑trained on the same contrastive loss, but with a slightly higher learning rate for the compressed layers to let them adapt.
  5. Evaluation – The compressed models are benchmarked on BEIR’s 18 heterogeneous retrieval tasks, measuring both effectiveness (nDCG, MAP) and efficiency (parameters, FLOPs, latency on a single GPU).

Results & Findings

Model (backbone)Params ↓FLOPs ↓nDCG@10 (full)nDCG@10 (EffiR)Speed‑up
LLaMA‑2‑7B7B → 2.1B (‑70 %)2.5× lower0.5270.521≈2.1×
Mistral‑7B7B → 2.3B (‑67 %)2.3× lower0.5430.538≈2.0×
LLaMA‑2‑13B13B → 4.0B (‑69 %)2.6× lower0.5620.557≈2.2×
  • MLP layers can be removed or heavily compressed with minimal impact on retrieval scores.
  • Attention layers are not pruned; removing them leads to >5 % nDCG loss, confirming their critical role in aggregating semantic cues across the query/document.
  • The coarse‑to‑fine approach consistently outperforms a single‑step width reduction, achieving better trade‑offs between size and accuracy.
  • Across all BEIR tasks, the average performance drop stays under 1 %, while inference latency halves on a single RTX 4090 GPU.

Practical Implications

  • Production‑ready dense retrieval: Companies can now deploy LLM‑powered retrievers on commodity hardware (single GPU or even CPU‑optimized inference) without sacrificing search quality.
  • Cost savings: A 2× speed‑up translates directly into lower cloud compute bills, making LLM‑based semantic search viable for startups and mid‑size enterprises.
  • Edge & mobile scenarios: The compressed models fit within the memory limits of high‑end mobile devices, opening doors for on‑device privacy‑preserving search (e.g., personal knowledge bases).
  • Rapid prototyping: The open‑source EffiR pipeline can be plugged into existing retrieval frameworks (e.g., Pyserini, Haystack), allowing developers to experiment with different LLM backbones and compression levels in minutes.
  • Future‑proofing: As newer, larger LLMs appear, the same redundancy patterns are expected to hold, meaning the same compression recipe can keep scaling costs in check.

Limitations & Future Work

  • Attention‑layer rigidity: The study confirms that attention blocks are indispensable for retrieval, but it does not explore more aggressive sparsity or low‑rank approximations within attention itself.
  • Domain‑specific fine‑tuning: Experiments focus on general‑purpose BEIR datasets; performance on highly specialized corpora (e.g., legal or biomedical) may require additional domain adaptation.
  • Hardware diversity: Benchmarks are run on high‑end GPUs; further evaluation on CPUs, TPUs, or inference accelerators would solidify real‑world applicability.
  • Dynamic inference: Future work could investigate conditional execution (e.g., early‑exit strategies) to further cut latency for easy queries.

Overall, the paper delivers a clear, actionable roadmap for turning heavyweight LLM retrievers into lean, production‑grade components—an advance that should resonate strongly with developers building next‑generation search and recommendation systems.

Authors

  • Yibin Lei
  • Shwai He
  • Ang Li
  • Andrew Yates

Paper Information

  • arXiv ID: 2512.20612v1
  • Categories: cs.IR, cs.CL
  • Published: December 23, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »