[Paper] LORE: A Large Generative Model for Search Relevance

Published: (December 2, 2025 at 01:50 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.03025v1

Overview

The paper presents LORE, a production‑grade framework that leverages large generative models (LLMs) to boost relevance ranking in e‑commerce search. After three years of real‑world iteration, LORE delivers a 27 % lift in the “GoodRate” metric, demonstrating that a carefully engineered LLM pipeline can outperform traditional relevance models at scale.

Key Contributions

  • Decomposition‑first design – Relevance is split into three orthogonal capabilities (knowledge + reasoning, multi‑modal matching, rule compliance) rather than treated as a single monolithic task.
  • Two‑stage training pipeline
    1. Supervised Fine‑Tuning (SFT) with progressive Chain‑of‑Thought (CoT) synthesis to teach the model how to think step‑by‑step.
    2. Reinforcement Learning from Human Preferences (RLHF) to align the model’s outputs with business‑critical relevance signals.
  • RAIR benchmark – A curated evaluation suite that isolates each capability, enabling systematic diagnostics and continuous improvement.
  • Query‑frequency‑aware deployment – A stratified serving architecture that routes high‑frequency queries to a lightweight inference path while still benefiting from the full LLM for long‑tail searches.
  • Production impact report – Detailed lessons learned from data collection, feature engineering, offline‑online evaluation loops, and operational monitoring.

Methodology

  1. Data & Feature Prep – The team aggregates click‑through logs, product catalogs, and user‑generated content (images, titles, reviews). They enrich this with external knowledge (e.g., brand hierarchies) and encode rule‑based constraints (e.g., prohibited terms).
  2. Progressive CoT SFT – Instead of feeding the model raw query‑product pairs, they generate intermediate reasoning steps (e.g., “Identify the product category → Match visual attributes → Apply promotional rules”) and fine‑tune the LLM to produce these steps before the final relevance score.
  3. Human Preference Alignment (RLHF) – Annotators rank multiple model outputs for the same query. The ranking data trains a reward model, which then guides policy optimization via Proximal Policy Optimization (PPO).
  4. Capability‑Specific Benchmarks (RAIR) – Test sets are divided into:
    • Knowledge/Reasoning: queries requiring factual inference (e.g., “water‑proof hiking boots”).
    • Multi‑modal Matching: queries that need visual‑textual alignment (e.g., “red floral dress”).
    • Rule Adherence: queries where business policies dominate (e.g., “discounted electronics”).
  5. Stratified Serving – Queries are bucketed by historical frequency. The top‑k frequent bucket uses a distilled, latency‑optimized model; the remaining bucket invokes the full LORE model, preserving quality for the long tail without hurting latency.

Results & Findings

MetricBaseline (traditional ranker)LORE (full pipeline)Δ
GoodRate (online)1.00 ×1.27 ×+27 %
NDCG@10 (RAIR) – Knowledge0.710.84+13 %
NDCG@10 – Multi‑modal0.680.80+18 %
NDCG@10 – Rule adherence0.750.88+17 %
Latency (99‑pct) – high‑freq bucket45 ms48 ms+3 ms (acceptable)

Interpretation: Decomposing relevance lets the model specialize, yielding consistent gains across all capability dimensions. The two‑stage training (SFT → RLHF) is crucial: SFT gives the model a solid “thinking” foundation, while RLHF aligns it with the business’s notion of “good” results. The stratified serving strategy keeps latency within production tolerances.

Practical Implications

  • For Search Engineers: LORE demonstrates that you can retrofit an LLM into an existing ranking stack without sacrificing latency, provided you adopt a frequency‑aware serving layer.
  • For Product Teams: The modular capability view makes it easier to prioritize engineering effort (e.g., focus on visual matching when launching a new apparel line).
  • For ML Ops: The paper’s lifecycle documentation—data pipelines, progressive CoT generation, RLHF loops, and continuous A/B testing—offers a reproducible template for other verticals such as travel, real‑estate, or job search.
  • Business Impact: A 27 % lift in GoodRate translates directly into higher conversion, lower bounce, and better user satisfaction, justifying the compute cost of LLM inference on the long tail.
  • Open‑source Potential: The RAIR benchmark can be adopted as a community standard for relevance evaluation, encouraging research that targets real‑world search constraints rather than generic language tasks.

Limitations & Future Work

  • Compute Overhead – Even with stratified serving, the full LLM remains expensive for massive traffic spikes; further model distillation or sparsity techniques could reduce cost.
  • Domain Transfer – LORE is tuned on a specific e‑commerce catalog; applying the same pipeline to a drastically different domain (e.g., medical literature) may require substantial re‑engineering of the capability decomposition.
  • Rule Evolution – Business policies change rapidly; the current pipeline relies on periodic re‑training rather than real‑time rule injection. Future work could explore dynamic rule adapters that modify LLM outputs on the fly.
  • Explainability – While CoT provides intermediate reasoning, the final relevance score is still a black‑box output; integrating more transparent scoring mechanisms would aid auditability.

Bottom line: LORE is a compelling case study that bridges cutting‑edge LLM research with the gritty realities of production e‑commerce search, offering a roadmap for teams eager to harness generative AI for relevance optimization.

Authors

  • Chenji Lu
  • Zhuo Chen
  • Hui Zhao
  • Zhiyuan Zeng
  • Gang Zhao
  • Junjie Ren
  • Ruicong Xu
  • Haoran Li
  • Songyan Liu
  • Pengjie Wang
  • Jian Xu
  • Bo Zheng

Paper Information

  • arXiv ID: 2512.03025v1
  • Categories: cs.IR, cs.AI, cs.CL, cs.LG
  • Published: December 2, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »