[Paper] RAIR: A Rule-Aware Benchmark Uniting Challenging Long-Tail and Visual Salience Subset for E-commerce Relevance Assessment
Source: arXiv - 2512.24943v1
Overview
The paper introduces RAIR, a rule‑aware benchmark that combines text and images to evaluate e‑commerce search relevance in Chinese. By mirroring real‑world shopping scenarios and enforcing a set of universal relevance rules, RAIR fills a gap in existing test suites, offering a tougher, more diagnostic yardstick for both large language models (LLMs) and visual language models (VLMs).
Key Contributions
- Standardized relevance framework – defines a clear, rule‑based evaluation protocol that can be adopted across the industry.
- Three‑tiered dataset:
- General subset – industry‑balanced sampling for baseline competency checks.
- Long‑tail hard subset – curated difficult queries (rare products, ambiguous intent) to stress‑test model limits.
- Visual salience subset – pairs queries with product images, probing multimodal understanding.
- Comprehensive empirical study – 14 open‑source and proprietary models (including GPT‑5) are benchmarked, revealing performance gaps even for state‑of‑the‑art systems.
- Open release – the dataset and evaluation scripts are publicly available, encouraging reproducibility and community‑wide adoption.
Methodology
- Data collection – Real e‑commerce search logs from a major Chinese platform were filtered and anonymized. Human annotators then labeled each query‑product pair with a relevance score according to a rulebook (e.g., “product must match the attribute explicitly mentioned in the query”).
- Rule‑aware design – The rulebook is encoded as a set of logical constraints that every model’s prediction must be judged against, ensuring consistency across evaluators.
- Subset construction:
- General: stratified sampling across product categories to reflect typical traffic.
- Long‑tail: mining low‑frequency queries and edge‑case products (e.g., niche accessories, misspelled terms).
- Visual salience: attaching high‑resolution product images and requiring models to fuse visual cues with textual intent.
- Evaluation pipeline – Models generate a relevance label (relevant / partially relevant / irrelevant). The pipeline automatically checks compliance with the rulebook and computes standard metrics (accuracy, F1) plus a Rule Violation Score that penalizes systematic rule breaches.
Results & Findings
| Model | General Acc. | Long‑Tail Acc. | Visual Salience Acc. | Rule Violation ↓ |
|---|---|---|---|---|
| GPT‑5 (closed) | 84.2% | 68.5% | 71.3% | 3.1% |
| Claude‑2 | 78.9% | 61.2% | 64.0% | 4.5% |
| LLaMA‑2‑13B | 71.4% | 49.8% | 52.7% | 9.8% |
| Open‑source VLM (e.g., BLIP‑2) | 69.0% | 45.3% | 78.1% | 7.2% |
| Baseline BM25 | 62.5% | 38.0% | 40.2% | 12.4% |
- Even GPT‑5 struggles on the long‑tail subset, dropping ~15 points compared to the general set, indicating that rare or ambiguous queries remain a blind spot.
- Visual salience helps VLMs: pure language models lag behind dedicated multimodal models on image‑grounded queries, but the gap narrows when language models are prompted with image captions.
- Rule Violation Score surfaces systematic failures (e.g., ignoring attribute constraints) that raw accuracy masks.
Practical Implications
- Benchmark‑driven product development – E‑commerce platforms can adopt RAIR to continuously monitor their search relevance pipelines, catching regressions before they affect shoppers.
- Model selection & fine‑tuning – The three subsets let engineers pinpoint whether a model needs better handling of rare queries, multimodal fusion, or rule compliance, guiding targeted fine‑tuning or prompt engineering.
- Standardized KPI – The rule‑aware metric offers a reproducible KPI that can be reported across vendors, facilitating fair comparisons and SLA definitions with AI service providers.
- Improved user experience – By surfacing weaknesses in handling niche products or visual cues, developers can prioritize data augmentation (e.g., synthetic product images) or rule‑based post‑processing to boost click‑through and conversion rates.
Limitations & Future Work
- Language scope – RAIR is currently Chinese‑only; extending to multilingual e‑commerce contexts will be necessary for global platforms.
- Static rulebook – The rule set reflects the authors’ domain expertise; future work could explore dynamic rule generation from business policies or user feedback.
- Model coverage – While 14 models were evaluated, the rapidly evolving LLM landscape means new architectures (e.g., instruction‑tuned multimodal models) will need fresh benchmarking.
- Real‑time latency – The benchmark focuses on relevance accuracy, not inference speed; integrating latency constraints would make it more production‑ready.
RAIR offers a concrete, industry‑aligned yardstick for measuring search relevance in e‑commerce, pushing both researchers and practitioners to build models that not only score high on average but also obey the business rules that matter to real shoppers.
Authors
- Chenji Lu
- Zhuo Chen
- Hui Zhao
- Zhenyi Wang
- Pengjie Wang
- Jian Xu
- Bo Zheng
Paper Information
- arXiv ID: 2512.24943v1
- Categories: cs.IR, cs.AI, cs.CL, cs.LG
- Published: December 31, 2025
- PDF: Download PDF