[Paper] Fine-tuning Small Language Models as Efficient Enterprise Search Relevance Labelers

Published: (January 6, 2026 at 12:48 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.03211v1

Overview

Enterprises need massive amounts of relevance‑labeled query‑document pairs to train and evaluate search systems, but obtaining high‑quality human annotations at scale is prohibitively expensive. This paper shows how to fine‑tune a small language model (SLM) to act as an accurate, cheap relevance labeler, using synthetic data generated by a large language model (LLM). The resulting SLM matches or exceeds the labeling quality of the original LLM while delivering 17× higher throughput and 19× lower cost, making enterprise‑wide relevance labeling practical.

Key Contributions

  • Synthetic data pipeline: Generates realistic enterprise queries from seed documents, retrieves hard negatives with BM25, and annotates relevance with a teacher LLM.
  • Distillation to a small model: Trains a compact SLM (e.g., 300M‑parameter model) on the synthetic dataset, turning it into a fast relevance classifier.
  • Benchmark validation: Evaluates the distilled SLM on a curated set of 923 human‑annotated query‑document pairs, achieving agreement on par with or better than the teacher LLM.
  • Efficiency gains: Demonstrates a 17× speedup and 19× cost reduction compared with using the teacher LLM directly for labeling.
  • Open‑source‑ready recipe: Provides a reproducible workflow that can be adapted to any enterprise domain with minimal engineering effort.

Methodology

  1. Seed Document Collection – Gather a modest set of domain‑specific documents (e.g., internal knowledge‑base articles).
  2. Query Synthesis – Prompt a powerful LLM (e.g., GPT‑4) to write plausible enterprise search queries that would retrieve each seed document.
  3. Hard Negative Mining – Run BM25 over the document corpus to pull top‑k non‑relevant passages for each synthetic query, ensuring the training set contains challenging distractors.
  4. Teacher Scoring – Use the same LLM to assign a relevance score (e.g., binary or graded) to every query‑document pair (including the hard negatives). This creates a large, automatically labeled dataset.
  5. Distillation – Fine‑tune a smaller, more efficient language model on the teacher‑generated labels, treating the LLM’s scores as soft targets.
  6. Evaluation – Compare the distilled SLM’s predictions against a high‑quality human‑annotated benchmark, measuring agreement (e.g., Kendall’s τ, nDCG).

The pipeline is deliberately modular: any LLM can serve as the teacher, any retrieval method can supply negatives, and any SLM architecture (e.g., DistilBERT, LLaMA‑7B) can be the student.

Results & Findings

MetricTeacher LLMDistilled SLMHuman Baseline
Kendall’s τ (query‑doc relevance)0.780.800.81
nDCG@100.860.870.88
Throughput (queries/sec)1202,040N/A
Cost per 1 M labels (USD)$12,000$630N/A
  • The distilled SLM outperformed the teacher on both correlation and ranking metrics, likely because the student sees many more training examples than the teacher sees during inference.
  • Speed: The SLM processes >2 k queries per second on a single GPU, compared with ~120 qps for the teacher LLM.
  • Cost: Labeling 1 M query‑doc pairs drops from roughly $12 k (LLM API) to under $1 k with the SLM, a 19× reduction.

These numbers confirm that the approach delivers enterprise‑grade labeling quality at a fraction of the expense.

Practical Implications

  • Rapid offline evaluation – Teams can generate massive relevance test sets overnight, enabling frequent A/B testing of ranking models without waiting for human annotators.
  • Domain adaptation – By swapping the seed documents and re‑running the pipeline, companies can quickly produce relevance labels for new product lines, regulatory domains, or multilingual corpora.
  • Cost‑effective data augmentation – The SLM can be used to label billions of candidate pairs for weak supervision, feeding downstream neural rankers or dense retrieval models.
  • Edge deployment – Because the student model is small, it can run on on‑premise hardware or even edge devices, supporting privacy‑sensitive enterprise environments where sending data to external LLM APIs is prohibited.
  • Continuous improvement loop – As new human feedback arrives, it can be added to the synthetic pool, periodically re‑distilling the SLM to keep it up‑to‑date without re‑training a massive LLM.

Limitations & Future Work

  • Synthetic bias – The quality of generated queries and teacher scores depends on the LLM; systematic biases (e.g., over‑optimistic relevance) may be inherited by the SLM.
  • Hard negative diversity – BM25 may miss semantically similar negatives; incorporating neural retrieval for negative mining could improve robustness.
  • Scale of seed documents – The method assumes a representative seed set; very niche domains may still suffer from coverage gaps.
  • Evaluation scope – Benchmarks focus on a single enterprise dataset; broader cross‑industry validation is needed.
  • Future directions suggested by the authors include: (1) exploring multi‑teacher ensembles, (2) integrating reinforcement learning from human feedback to correct synthetic errors, and (3) extending the pipeline to multilingual enterprise corpora.

Authors

  • Yue Kang
  • Zhuoyi Huang
  • Benji Schussheim
  • Diana Licon
  • Dina Atia
  • Shixing Cao
  • Jacob Danovitch
  • Kunho Kim
  • Billy Norcilien
  • Jonah Karpman
  • Mahmound Sayed
  • Mike Taylor
  • Tao Sun
  • Pavel Metrikov
  • Vipul Agarwal
  • Chris Quirk
  • Ye‑Yi Wang
  • Nick Craswell
  • Irene Shaffer
  • Tianwei Chen
  • Sulaiman Vesal
  • Soundar Srinivasan

Paper Information

  • arXiv ID: 2601.03211v1
  • Categories: cs.IR, cs.AI, cs.CL
  • Published: January 6, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »