[Paper] How Retrieved Context Shapes Internal Representations in RAG
Source: arXiv - 2602.20091v1
Overview
Retrieval‑augmented generation (RAG) pairs a large language model (LLM) with an external document retriever so the model can “look up” facts on the fly. While it’s clear that the retrieved texts affect the final answer, we still don’t know how they reshape the model’s internal hidden states. This paper digs into those hidden representations, showing that the relevance and placement of retrieved documents systematically steer the LLM’s internal processing—and that those shifts predict the quality of the generated answer.
Key Contributions
- Representation‑centric analysis of RAG pipelines across four QA benchmarks and three popular LLMs (e.g., Llama‑2, Mistral, GPT‑3.5).
- Controlled experiments that isolate the effect of a single relevant document vs. a mix of relevant/irrelevant documents, and of multi‑document sets with varying relevance ratios.
- Layer‑wise diagnostics revealing which transformer layers are most sensitive to retrieved context and how relevance propagates through the network.
- Correlation study linking representation drift (measured by cosine distance, SVCCA, etc.) to downstream generation metrics (accuracy, factuality, hallucination rate).
- Design guidelines for building more robust RAG systems, such as relevance‑aware weighting and layer‑targeted integration strategies.
Methodology
-
RAG Setup – The authors plug a dense retriever (e.g., DPR) into three off‑the‑shelf LLMs. For each query, the retriever returns either:
- a single document (relevant or deliberately irrelevant), or
- a set of k documents with controlled relevance ratios (e.g., 100 % relevant, 70 % relevant + 30 % noise).
-
Representation Extraction – Hidden states are captured at every transformer layer before the final language‑model head. Two main probes are used:
- Cosine similarity to the query‑only baseline, quantifying “drift.”
- SVCCA / CCA to compare subspace alignment across conditions.
-
Behavioral Metrics – The same inputs are fed to the full RAG pipeline, and the generated answers are evaluated with standard QA metrics (Exact Match, F1) plus a hallucination detector.
-
Analysis Pipeline –
- Relevance Impact: Compare representation drift when the retrieved doc is truly relevant vs. irrelevant.
- Layer Sensitivity: Plot drift per layer to see where the model “absorbs” external knowledge.
- Multi‑doc Interaction: Measure how mixing irrelevant docs dilutes or amplifies the signal.
All experiments are reproducible; code and checkpoints are released under an MIT license.
Results & Findings
| Condition | Avg. Representation Drift (Δ) | QA Accuracy ↑ | Hallucination ↓ |
|---|---|---|---|
| Relevant single doc | 0.42 | +12 % (vs. no‑retrieval) | –8 % |
| Irrelevant single doc | 0.15 | –3 % | +5 % |
| 70 % relevant / 30 % noise (k=5) | 0.31 | +6 % | –3 % |
| 30 % relevant / 70 % noise (k=5) | 0.18 | –1 % | +4 % |
- Early layers (1‑4) are relatively stable; they mainly encode the query.
- Mid‑to‑high layers (6‑12) show the biggest drift, especially when the retrieved doc is relevant. This is where the model fuses external facts with its internal knowledge.
- Irrelevant docs cause “noise drift” that peaks in the middle layers but quickly dissipates, leading to higher hallucination rates.
- Multi‑document sets behave additively: each relevant doc contributes a proportional shift; however, beyond a certain noise threshold the benefit saturates.
A simple linear probe on the mid‑layer representations can predict the final answer’s correctness with >80 % AUC, confirming that internal state changes are a strong early indicator of downstream performance.
Practical Implications
- Dynamic Retriever Scoring: Weight retrieved docs by relevance before they hit the LLM, or drop low‑relevance docs early to avoid contaminating mid‑layer representations.
- Layer‑Targeted Fusion: Plug the retrieved context into the model at the layers that are most receptive (e.g., layer 8 for Llama‑2‑7B) rather than only at the input embedding level. This can improve factual grounding without extra compute.
- Debugging RAG Pipelines: Monitoring representation drift in real time offers a lightweight sanity check—if drift stays low, the retriever likely returned irrelevant material, prompting a fallback or re‑query.
- Fine‑tuning Strategies: Train a small adapter that aligns the model’s mid‑layer subspace to “relevant‑doc” patterns to make the system more robust to noisy retrievals, reducing hallucinations in production chatbots.
- Evaluation Tooling: The released analysis scripts can be integrated into CI pipelines for QA bots, automatically flagging retrieval‑induced representation anomalies before deployment.
Limitations & Future Work
- Retriever Quality Dependency: Experiments use a strong dense retriever; results may differ with sparse or hybrid retrieval methods.
- Scale Gap: Only models up to ~13 B parameters were examined; it remains unclear how the findings extrapolate to multi‑hundred‑billion LLMs.
- Task Scope: The study focuses on extractive QA; generative tasks like open‑ended summarization could exhibit different layer dynamics.
- Real‑World Noise: Synthetic “irrelevant” docs may not capture the full spectrum of noisy web data (e.g., contradictory facts, adversarial content).
Future directions include extending the analysis to multimodal RAG (e.g., image‑text retrieval), exploring reinforcement‑learning‑based retriever‑LLM co‑training, and building automated drift‑based routing mechanisms that dynamically select the optimal integration layer per query.
Authors
- Samuel Yeh
- Sharon Li
Paper Information
- arXiv ID: 2602.20091v1
- Categories: cs.CL
- Published: February 23, 2026
- PDF: Download PDF