[Paper] Mine and Refine: Optimizing Graded Relevance in E-commerce Search Retrieval
Source: arXiv - 2602.17654v1
Overview
The paper introduces “Mine and Refine,” a two‑stage contrastive training pipeline that builds better semantic embeddings for e‑commerce search. By explicitly modeling graded relevance (exact match, substitute, complement) the authors achieve more reliable ranking scores that translate into higher click‑through and conversion rates in a live marketplace.
Key Contributions
- Two‑stage “Mine & Refine” framework – first learns a global multilingual embedding space, then sharpens it with hard‑sample mining and relevance‑aware loss.
- Label‑aware supervised contrastive loss that respects three relevance levels, producing embeddings that naturally separate these strata.
- Policy‑consistent supervision via a lightweight LLM fine‑tuned on human annotations, ensuring the model respects product‑listing rules and safety constraints.
- Multi‑class circle loss (an extension of the classic circle loss) that explicitly pushes apart embeddings belonging to different relevance grades.
- Robustness tricks – spelling augmentation, synthetic query generation, and engagement‑driven auditing to clean noisy labels.
- Extensive validation – offline metrics, large‑scale A/B tests, and measurable business impact (higher engagement, revenue uplift).
Methodology
-
Stage 1 – Global Retrieval Backbone
- A multilingual Siamese two‑tower architecture (query tower ↔ product tower) is trained on millions of query‑product pairs.
- The label‑aware supervised contrastive objective treats each relevance grade as a separate “label.” Positive pairs share the same grade, while negatives are drawn from other grades, encouraging the model to carve out distinct regions for exact matches, substitutes, and complements.
-
Stage 2 – Hard‑Sample Mining & Refinement
- Using Approximate Nearest Neighbor (ANN) search on the Stage 1 embeddings, the system mines hard pairs that lie near decision boundaries.
- These hard pairs are re‑annotated by a policy‑aligned LLM (a small language model fine‑tuned on a curated set of human relevance judgments). This step injects consistent, rule‑aware labels while filtering out noisy crowd‑sourced signals.
- The refined embeddings are trained with a multi‑class circle loss, which directly maximizes angular margins between the three relevance clusters, making the similarity scores more separable for downstream ranking/blending.
-
Robustness Enhancements
- Spelling augmentation (random character swaps, deletions) expands the query distribution to cover typo‑heavy user input.
- Synthetic query generation creates paraphrases and domain‑specific variations, further diversifying training data.
- Engagement‑driven auditing monitors live click‑through and conversion signals to spot systematic labeling errors and trigger additional LLM re‑annotation cycles.
Results & Findings
| Metric | Baseline (single‑stage) | Mine & Refine |
|---|---|---|
| Offline Recall@100 (multilingual) | 0.71 | 0.78 (+9.9%) |
| NDCG@10 (graded relevance) | 0.62 | 0.70 (+12.9%) |
| Live Click‑Through Rate (CTR) uplift | – | +4.3 % |
| Conversion Rate uplift | – | +3.1 % |
| Revenue per search session | – | +2.6 % |
- The refined embeddings produce clearer score gaps between relevance levels, simplifying threshold tuning for hybrid (BM25 + neural) systems.
- A/B tests showed statistically significant improvements across core engagement KPIs, confirming that the offline gains translate to real‑world user behavior.
Practical Implications
- More stable hybrid ranking – With distinct similarity bands for exact, substitute, and complementary matches, engineers can blend lexical and neural scores without frequent re‑calibration.
- Policy compliance baked in – Using an LLM that respects product‑listing rules reduces the risk of surfacing prohibited or unsafe items, a common concern in regulated marketplaces.
- Scalable to long‑tail queries – The multilingual backbone and typo‑robust augmentations mean the system works well for rare or misspelled queries without needing per‑query hand‑crafting.
- Reduced engineering overhead – Hard‑sample mining automatically surfaces the most informative training pairs, cutting down on manual data labeling cycles.
- Plug‑and‑play component – The two‑tower architecture and loss functions can be dropped into existing retrieval pipelines (e.g., Faiss, Milvus) with minimal infrastructure changes.
Limitations & Future Work
- Dependence on LLM quality – The refinement stage hinges on the policy‑aligned LLM’s ability to mimic human relevance judgments; any drift in the LLM could propagate errors.
- Three‑level relevance granularity – While sufficient for many marketplaces, some domains may need finer granularity (e.g., “highly relevant” vs. “moderately relevant”). Extending the loss to more classes is an open avenue.
- Computational cost of hard‑sample mining – ANN search over billions of vectors is non‑trivial; future work could explore more efficient on‑the‑fly mining or curriculum‑based sampling.
- Cross‑modal extensions – The current work focuses on text‑only embeddings; integrating product images or video could further boost relevance for visual‑heavy catalogs.
Bottom line: “Mine and Refine” offers a pragmatic, production‑ready recipe for building e‑commerce search embeddings that respect graded relevance, stay policy‑compliant, and deliver measurable business lift—making it a compelling addition to any modern search stack.
Authors
- Jiaqi Xi
- Raghav Saboo
- Luming Chen
- Martin Wang
- Sudeep Das
Paper Information
- arXiv ID: 2602.17654v1
- Categories: cs.IR, cs.LG
- Published: February 19, 2026
- PDF: Download PDF