[Paper] Fine-tuning Small Language Models as Efficient Enterprise Search Relevance Labelers

Published: 1 month ago (January 6, 2026 at 12:48 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.03211v1

Overview

Enterprises need massive amounts of relevance‑labeled query‑document pairs to train and evaluate search systems, but obtaining high‑quality human annotations at scale is prohibitively expensive. This paper shows how to fine‑tune a small language model (SLM) to act as an accurate, cheap relevance labeler, using synthetic data generated by a large language model (LLM). The resulting SLM matches or exceeds the labeling quality of the original LLM while delivering 17× higher throughput and 19× lower cost, making enterprise‑wide relevance labeling practical.

Key Contributions

Synthetic data pipeline: Generates realistic enterprise queries from seed documents, retrieves hard negatives with BM25, and annotates relevance with a teacher LLM.
Distillation to a small model: Trains a compact SLM (e.g., 300M‑parameter model) on the synthetic dataset, turning it into a fast relevance classifier.
Benchmark validation: Evaluates the distilled SLM on a curated set of 923 human‑annotated query‑document pairs, achieving agreement on par with or better than the teacher LLM.
Efficiency gains: Demonstrates a 17× speedup and 19× cost reduction compared with using the teacher LLM directly for labeling.
Open‑source‑ready recipe: Provides a reproducible workflow that can be adapted to any enterprise domain with minimal engineering effort.

Methodology

Seed Document Collection – Gather a modest set of domain‑specific documents (e.g., internal knowledge‑base articles).
Query Synthesis – Prompt a powerful LLM (e.g., GPT‑4) to write plausible enterprise search queries that would retrieve each seed document.
Hard Negative Mining – Run BM25 over the document corpus to pull top‑k non‑relevant passages for each synthetic query, ensuring the training set contains challenging distractors.
Teacher Scoring – Use the same LLM to assign a relevance score (e.g., binary or graded) to every query‑document pair (including the hard negatives). This creates a large, automatically labeled dataset.
Distillation – Fine‑tune a smaller, more efficient language model on the teacher‑generated labels, treating the LLM’s scores as soft targets.
Evaluation – Compare the distilled SLM’s predictions against a high‑quality human‑annotated benchmark, measuring agreement (e.g., Kendall’s τ, nDCG).

The pipeline is deliberately modular: any LLM can serve as the teacher, any retrieval method can supply negatives, and any SLM architecture (e.g., DistilBERT, LLaMA‑7B) can be the student.

Results & Findings

Metric	Teacher LLM	Distilled SLM	Human Baseline
Kendall’s τ (query‑doc relevance)	0.78	0.80	0.81
nDCG@10	0.86	0.87	0.88
Throughput (queries/sec)	120	2,040	N/A
Cost per 1 M labels (USD)	$12,000	$630	N/A

The distilled SLM outperformed the teacher on both correlation and ranking metrics, likely because the student sees many more training examples than the teacher sees during inference.
Speed: The SLM processes >2 k queries per second on a single GPU, compared with ~120 qps for the teacher LLM.
Cost: Labeling 1 M query‑doc pairs drops from roughly $12 k (LLM API) to under $1 k with the SLM, a 19× reduction.

These numbers confirm that the approach delivers enterprise‑grade labeling quality at a fraction of the expense.

Practical Implications

Rapid offline evaluation – Teams can generate massive relevance test sets overnight, enabling frequent A/B testing of ranking models without waiting for human annotators.
Domain adaptation – By swapping the seed documents and re‑running the pipeline, companies can quickly produce relevance labels for new product lines, regulatory domains, or multilingual corpora.
Cost‑effective data augmentation – The SLM can be used to label billions of candidate pairs for weak supervision, feeding downstream neural rankers or dense retrieval models.
Edge deployment – Because the student model is small, it can run on on‑premise hardware or even edge devices, supporting privacy‑sensitive enterprise environments where sending data to external LLM APIs is prohibited.
Continuous improvement loop – As new human feedback arrives, it can be added to the synthetic pool, periodically re‑distilling the SLM to keep it up‑to‑date without re‑training a massive LLM.

Limitations & Future Work

Synthetic bias – The quality of generated queries and teacher scores depends on the LLM; systematic biases (e.g., over‑optimistic relevance) may be inherited by the SLM.
Hard negative diversity – BM25 may miss semantically similar negatives; incorporating neural retrieval for negative mining could improve robustness.
Scale of seed documents – The method assumes a representative seed set; very niche domains may still suffer from coverage gaps.
Evaluation scope – Benchmarks focus on a single enterprise dataset; broader cross‑industry validation is needed.
Future directions suggested by the authors include: (1) exploring multi‑teacher ensembles, (2) integrating reinforcement learning from human feedback to correct synthetic errors, and (3) extending the pipeline to multilingual enterprise corpora.

Authors

Yue Kang
Zhuoyi Huang
Benji Schussheim
Diana Licon
Dina Atia
Shixing Cao
Jacob Danovitch
Kunho Kim
Billy Norcilien
Jonah Karpman
Mahmound Sayed
Mike Taylor
Tao Sun
Pavel Metrikov
Vipul Agarwal
Chris Quirk
Ye‑Yi Wang
Nick Craswell
Irene Shaffer
Tianwei Chen
Sulaiman Vesal
Soundar Srinivasan

Paper Information

arXiv ID: 2601.03211v1
Categories: cs.IR, cs.AI, cs.CL
Published: January 6, 2026
PDF: Download PDF

[Paper] Fine-tuning Small Language Models as Efficient Enterprise Search Relevance Labelers

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling for LLMs

[Paper] The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning

[Paper] Can We Predict Before Executing Machine Learning Agents?

[Paper] Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency