[Paper] Scaling Retrieval Augmented Generation with RAG Fusion: Lessons from an Industry Deployment

Published: 1 day ago (March 2, 2026 at 01:15 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2603.02153v1

Overview

The paper investigates whether classic retrieval‑fusion tricks—like issuing multiple queries or applying Reciprocal Rank Fusion (RRF)—actually make a difference in a real‑world Retrieval‑Augmented Generation (RAG) system. By wiring these techniques into an enterprise‑scale pipeline (fixed retrieval depth, re‑ranking budget, and strict latency limits), the authors show that the expected boost in answer quality often evaporates once the system’s downstream constraints are taken into account.

Key Contributions

Empirical evaluation of retrieval fusion in a production‑style RAG stack (multi‑query, RRF, and hybrid variants).
Demonstration that raw recall improvements do not reliably translate into higher end‑to‑end accuracy (e.g., Hit@10 drops from 0.51 to 0.48 for several fusion configs).
Quantitative analysis of latency overhead introduced by query rewriting and larger candidate pools.
A framework for joint evaluation of retrieval quality, system efficiency, and downstream generation impact.
Practical recommendations for engineers: prioritize budget‑aware re‑ranking over aggressive fusion when operating under latency constraints.

Methodology

Dataset & Knowledge Base – An internal enterprise knowledge base (≈ millions of documents) with a set of user‑query test cases.
Baseline Pipeline – Single‑query retrieval (BM25 + dense encoder) → top‑k candidates → lightweight cross‑encoder re‑ranking → truncated context fed to a LLM generator.
Fusion Variants
- Multi‑query: generate several paraphrases of the original query and pool results.
- Reciprocal Rank Fusion (RRF): merge ranked lists from different retrievers using the classic RRF formula.
- Hybrid: combine multi‑query with RRF.
Constraints – Fixed retrieval depth (e.g., 100 docs), a hard re‑ranking budget (max 20 cross‑encoder calls), and a latency ceiling (~300 ms per request).
Metrics –
- Recall@k at the retrieval stage.
- KB‑level Top‑k accuracy (Hit@10) after re‑ranking and generation.
- Latency (query rewrite + retrieval + re‑ranking).

All experiments were run on the same hardware to isolate the effect of the fusion logic.

Results & Findings

Configuration	Retrieval Recall@100	Hit@10 (end‑to‑end)	Avg. Latency
Single‑query (baseline)	0.62	0.51	280 ms
Multi‑query (3 paraphrases)	0.71 (+14 pts)	0.48	340 ms
RRF (2 retrievers)	0.68 (+6 pts)	0.49	325 ms
Hybrid (multi‑query + RRF)	0.73 (+11 pts)	0.48	360 ms

Key takeaways

Recall gains are real (up to +14 pts) but vanish after re‑ranking because the re‑ranker can only inspect a limited slice of the enlarged candidate set.
Hit@10 never surpasses the baseline; in fact, it drops modestly for most fusion setups.
Latency increases by 15‑30 %, primarily due to extra query generation and larger pools fed to the re‑ranker.
The re‑ranking budget is the bottleneck: once you hit the limit, adding more candidates does not help and may even hurt because the best documents get pushed out of the top‑k that the re‑ranker sees.

Practical Implications

Engineers should treat retrieval fusion as a “budget‑aware” optimization. If your pipeline already hits a tight latency or re‑ranking quota, throwing more queries at the retriever is unlikely to improve user‑facing answers.
Focus on smarter re‑ranking (e.g., early‑exit models, hierarchical rerankers) rather than expanding the raw candidate pool.
Monitoring pipelines: Include both recall‑level metrics and downstream accuracy/latency in dashboards; a rise in recall alone can be a red flag if end‑to‑end quality stagnates.
Cost‑sensitive deployments (cloud‑based RAG services) can save compute dollars by disabling multi‑query or RRF when operating under strict SLAs.
For enterprise search products, the paper suggests that a well‑tuned single‑query retriever + efficient re‑ranker often beats more complex fusion pipelines.

Limitations & Future Work

The study is confined to one proprietary knowledge base; results may differ on open‑domain corpora or multilingual data.
Only one type of re‑ranker (cross‑encoder) and a single LLM generator were examined; alternative architectures could change the trade‑off.
Latency measurements were taken on fixed hardware; scaling to distributed or GPU‑accelerated setups might mitigate some overhead.
Future research directions include: adaptive fusion that dynamically adjusts the number of queries based on latency budget, and joint training of retriever‑fusion and re‑ranker components to better align recall with downstream effectiveness.

Authors

Luigi Medrano
Arush Verma
Mukul Chhabra

Paper Information

arXiv ID: 2603.02153v1
Categories: cs.IR, cs.AI, cs.CL
Published: March 2, 2026
PDF: Download PDF

[Paper] Scaling Retrieval Augmented Generation with RAG Fusion: Lessons from an Industry Deployment

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Tool Verification for Test-Time Reinforcement Learning

[Paper] Zero- and Few-Shot Named-Entity Recognition: Case Study and Dataset in the Crime Domain (CrimeNER)

[Paper] LLMs as Strategic Actors: Behavioral Alignment, Risk Calibration, and Argumentation Framing in Geopolitical Simulations

[Paper] Recursive Models for Long-Horizon Reasoning