[Paper] LORE: A Large Generative Model for Search Relevance
Source: arXiv - 2512.03025v1
Overview
The paper presents LORE, a production‑grade framework that leverages large generative models (LLMs) to boost relevance ranking in e‑commerce search. After three years of real‑world iteration, LORE delivers a 27 % lift in the “GoodRate” metric, demonstrating that a carefully engineered LLM pipeline can outperform traditional relevance models at scale.
Key Contributions
- Decomposition‑first design – Relevance is split into three orthogonal capabilities (knowledge + reasoning, multi‑modal matching, rule compliance) rather than treated as a single monolithic task.
- Two‑stage training pipeline –
- Supervised Fine‑Tuning (SFT) with progressive Chain‑of‑Thought (CoT) synthesis to teach the model how to think step‑by‑step.
- Reinforcement Learning from Human Preferences (RLHF) to align the model’s outputs with business‑critical relevance signals.
- RAIR benchmark – A curated evaluation suite that isolates each capability, enabling systematic diagnostics and continuous improvement.
- Query‑frequency‑aware deployment – A stratified serving architecture that routes high‑frequency queries to a lightweight inference path while still benefiting from the full LLM for long‑tail searches.
- Production impact report – Detailed lessons learned from data collection, feature engineering, offline‑online evaluation loops, and operational monitoring.
Methodology
- Data & Feature Prep – The team aggregates click‑through logs, product catalogs, and user‑generated content (images, titles, reviews). They enrich this with external knowledge (e.g., brand hierarchies) and encode rule‑based constraints (e.g., prohibited terms).
- Progressive CoT SFT – Instead of feeding the model raw query‑product pairs, they generate intermediate reasoning steps (e.g., “Identify the product category → Match visual attributes → Apply promotional rules”) and fine‑tune the LLM to produce these steps before the final relevance score.
- Human Preference Alignment (RLHF) – Annotators rank multiple model outputs for the same query. The ranking data trains a reward model, which then guides policy optimization via Proximal Policy Optimization (PPO).
- Capability‑Specific Benchmarks (RAIR) – Test sets are divided into:
- Knowledge/Reasoning: queries requiring factual inference (e.g., “water‑proof hiking boots”).
- Multi‑modal Matching: queries that need visual‑textual alignment (e.g., “red floral dress”).
- Rule Adherence: queries where business policies dominate (e.g., “discounted electronics”).
- Stratified Serving – Queries are bucketed by historical frequency. The top‑k frequent bucket uses a distilled, latency‑optimized model; the remaining bucket invokes the full LORE model, preserving quality for the long tail without hurting latency.
Results & Findings
| Metric | Baseline (traditional ranker) | LORE (full pipeline) | Δ |
|---|---|---|---|
| GoodRate (online) | 1.00 × | 1.27 × | +27 % |
| NDCG@10 (RAIR) – Knowledge | 0.71 | 0.84 | +13 % |
| NDCG@10 – Multi‑modal | 0.68 | 0.80 | +18 % |
| NDCG@10 – Rule adherence | 0.75 | 0.88 | +17 % |
| Latency (99‑pct) – high‑freq bucket | 45 ms | 48 ms | +3 ms (acceptable) |
Interpretation: Decomposing relevance lets the model specialize, yielding consistent gains across all capability dimensions. The two‑stage training (SFT → RLHF) is crucial: SFT gives the model a solid “thinking” foundation, while RLHF aligns it with the business’s notion of “good” results. The stratified serving strategy keeps latency within production tolerances.
Practical Implications
- For Search Engineers: LORE demonstrates that you can retrofit an LLM into an existing ranking stack without sacrificing latency, provided you adopt a frequency‑aware serving layer.
- For Product Teams: The modular capability view makes it easier to prioritize engineering effort (e.g., focus on visual matching when launching a new apparel line).
- For ML Ops: The paper’s lifecycle documentation—data pipelines, progressive CoT generation, RLHF loops, and continuous A/B testing—offers a reproducible template for other verticals such as travel, real‑estate, or job search.
- Business Impact: A 27 % lift in GoodRate translates directly into higher conversion, lower bounce, and better user satisfaction, justifying the compute cost of LLM inference on the long tail.
- Open‑source Potential: The RAIR benchmark can be adopted as a community standard for relevance evaluation, encouraging research that targets real‑world search constraints rather than generic language tasks.
Limitations & Future Work
- Compute Overhead – Even with stratified serving, the full LLM remains expensive for massive traffic spikes; further model distillation or sparsity techniques could reduce cost.
- Domain Transfer – LORE is tuned on a specific e‑commerce catalog; applying the same pipeline to a drastically different domain (e.g., medical literature) may require substantial re‑engineering of the capability decomposition.
- Rule Evolution – Business policies change rapidly; the current pipeline relies on periodic re‑training rather than real‑time rule injection. Future work could explore dynamic rule adapters that modify LLM outputs on the fly.
- Explainability – While CoT provides intermediate reasoning, the final relevance score is still a black‑box output; integrating more transparent scoring mechanisms would aid auditability.
Bottom line: LORE is a compelling case study that bridges cutting‑edge LLM research with the gritty realities of production e‑commerce search, offering a roadmap for teams eager to harness generative AI for relevance optimization.
Authors
- Chenji Lu
- Zhuo Chen
- Hui Zhao
- Zhiyuan Zeng
- Gang Zhao
- Junjie Ren
- Ruicong Xu
- Haoran Li
- Songyan Liu
- Pengjie Wang
- Jian Xu
- Bo Zheng
Paper Information
- arXiv ID: 2512.03025v1
- Categories: cs.IR, cs.AI, cs.CL, cs.LG
- Published: December 2, 2025
- PDF: Download PDF