[Paper] Scaling Retrieval Augmented Generation with RAG Fusion: Lessons from an Industry Deployment

Published: (March 2, 2026 at 01:15 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2603.02153v1

Overview

The paper investigates whether classic retrieval‑fusion tricks—like issuing multiple queries or applying Reciprocal Rank Fusion (RRF)—actually make a difference in a real‑world Retrieval‑Augmented Generation (RAG) system. By wiring these techniques into an enterprise‑scale pipeline (fixed retrieval depth, re‑ranking budget, and strict latency limits), the authors show that the expected boost in answer quality often evaporates once the system’s downstream constraints are taken into account.

Key Contributions

  • Empirical evaluation of retrieval fusion in a production‑style RAG stack (multi‑query, RRF, and hybrid variants).
  • Demonstration that raw recall improvements do not reliably translate into higher end‑to‑end accuracy (e.g., Hit@10 drops from 0.51 to 0.48 for several fusion configs).
  • Quantitative analysis of latency overhead introduced by query rewriting and larger candidate pools.
  • A framework for joint evaluation of retrieval quality, system efficiency, and downstream generation impact.
  • Practical recommendations for engineers: prioritize budget‑aware re‑ranking over aggressive fusion when operating under latency constraints.

Methodology

  1. Dataset & Knowledge Base – An internal enterprise knowledge base (≈ millions of documents) with a set of user‑query test cases.
  2. Baseline Pipeline – Single‑query retrieval (BM25 + dense encoder) → top‑k candidates → lightweight cross‑encoder re‑ranking → truncated context fed to a LLM generator.
  3. Fusion Variants
    • Multi‑query: generate several paraphrases of the original query and pool results.
    • Reciprocal Rank Fusion (RRF): merge ranked lists from different retrievers using the classic RRF formula.
    • Hybrid: combine multi‑query with RRF.
  4. Constraints – Fixed retrieval depth (e.g., 100 docs), a hard re‑ranking budget (max 20 cross‑encoder calls), and a latency ceiling (~300 ms per request).
  5. Metrics
    • Recall@k at the retrieval stage.
    • KB‑level Top‑k accuracy (Hit@10) after re‑ranking and generation.
    • Latency (query rewrite + retrieval + re‑ranking).

All experiments were run on the same hardware to isolate the effect of the fusion logic.

Results & Findings

ConfigurationRetrieval Recall@100Hit@10 (end‑to‑end)Avg. Latency
Single‑query (baseline)0.620.51280 ms
Multi‑query (3 paraphrases)0.71 (+14 pts)0.48340 ms
RRF (2 retrievers)0.68 (+6 pts)0.49325 ms
Hybrid (multi‑query + RRF)0.73 (+11 pts)0.48360 ms

Key takeaways

  • Recall gains are real (up to +14 pts) but vanish after re‑ranking because the re‑ranker can only inspect a limited slice of the enlarged candidate set.
  • Hit@10 never surpasses the baseline; in fact, it drops modestly for most fusion setups.
  • Latency increases by 15‑30 %, primarily due to extra query generation and larger pools fed to the re‑ranker.
  • The re‑ranking budget is the bottleneck: once you hit the limit, adding more candidates does not help and may even hurt because the best documents get pushed out of the top‑k that the re‑ranker sees.

Practical Implications

  • Engineers should treat retrieval fusion as a “budget‑aware” optimization. If your pipeline already hits a tight latency or re‑ranking quota, throwing more queries at the retriever is unlikely to improve user‑facing answers.
  • Focus on smarter re‑ranking (e.g., early‑exit models, hierarchical rerankers) rather than expanding the raw candidate pool.
  • Monitoring pipelines: Include both recall‑level metrics and downstream accuracy/latency in dashboards; a rise in recall alone can be a red flag if end‑to‑end quality stagnates.
  • Cost‑sensitive deployments (cloud‑based RAG services) can save compute dollars by disabling multi‑query or RRF when operating under strict SLAs.
  • For enterprise search products, the paper suggests that a well‑tuned single‑query retriever + efficient re‑ranker often beats more complex fusion pipelines.

Limitations & Future Work

  • The study is confined to one proprietary knowledge base; results may differ on open‑domain corpora or multilingual data.
  • Only one type of re‑ranker (cross‑encoder) and a single LLM generator were examined; alternative architectures could change the trade‑off.
  • Latency measurements were taken on fixed hardware; scaling to distributed or GPU‑accelerated setups might mitigate some overhead.
  • Future research directions include: adaptive fusion that dynamically adjusts the number of queries based on latency budget, and joint training of retriever‑fusion and re‑ranker components to better align recall with downstream effectiveness.

Authors

  • Luigi Medrano
  • Arush Verma
  • Mukul Chhabra

Paper Information

  • arXiv ID: 2603.02153v1
  • Categories: cs.IR, cs.AI, cs.CL
  • Published: March 2, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »