[Paper] AttentionRetriever: Attention Layers are Secretly Long Document Retrievers
Source: arXiv - 2602.12278v1
Overview
The paper introduces AttentionRetriever, a new retrieval model that turns the attention layers inside modern language models into powerful long‑document search engines. By making the retrieval process context‑aware and causally aware, it bridges a gap that existing dense or sparse retrievers struggle with when dealing with multi‑page texts such as reports, manuals, or legal contracts.
Key Contributions
- Attention‑based Retrieval Engine – Re‑purposes the self‑attention mechanism of transformer models to generate document embeddings that capture long‑range dependencies without exploding computational cost.
- Entity‑driven Contextualization – Incorporates entity‑level signals (named entities, coreference clusters) to produce context‑aware representations, enabling the model to understand which parts of a document are relevant to a query.
- Scope‑Determination Module – A lightweight classifier predicts the retrieval scope (e.g., whole document vs. specific passage) on the fly, addressing the “how much to fetch” problem that plagues RAG pipelines.
- Efficiency Parity with Dense Retrieval – Despite handling much longer inputs, the model runs at comparable latency to standard dense retrievers (e.g., DPR, ANCE).
- Strong Empirical Gains – Outperforms state‑of‑the‑art long‑document retrievers on multiple benchmarks (LongEval, Wiki-Long, and a proprietary legal‑case set) by 8–15 % absolute recall@10.
Methodology
-
Input Encoding
- The source document is split into overlapping chunks (e.g., 512 tokens).
- Each chunk passes through a frozen transformer encoder (e.g., BERT‑base) to obtain token‑level hidden states.
-
Entity Extraction & Tagging
- A lightweight NER module tags entities in each chunk.
- Entity embeddings are pooled and concatenated with the chunk’s CLS token, forming an entity‑augmented representation.
-
Attention‑Based Fusion
- All chunk representations are fed into a global attention layer that treats each chunk as a “token” in a higher‑level sequence.
- This layer learns to weight chunks based on their relevance to the query, effectively summarizing the entire document into a single context‑aware vector.
-
Scope Prediction
- A binary classifier (trained jointly) decides whether the query needs the full document embedding or a passage‑level embedding.
- If passage‑level is chosen, the top‑k attended chunks are returned; otherwise, the whole‑document vector is used.
-
Training Regime
- Contrastive loss (InfoNCE) aligns query vectors with the correct document vectors while pushing away negatives.
- Hard negatives are mined from the same corpus using BM25 and from other long documents that share entities.
The whole pipeline is end‑to‑end differentiable, yet the heavy transformer encoder can stay frozen, keeping training and inference cheap.
Results & Findings
| Dataset | Metric (Recall@10) | Baseline (DPR) | AttentionRetriever |
|---|---|---|---|
| LongEval (news articles) | 0.68 | 0.55 | 0.78 |
| Wiki‑Long (Wikipedia sections) | 0.74 | 0.61 | 0.86 |
| Legal‑Case (10k contracts) | 0.62 | 0.48 | 0.77 |
- Context‑awareness: Ablation removing entity augmentation drops performance by ~4 %, confirming the value of entity signals.
- Scope module: Using a fixed retrieval scope (always whole‑doc) reduces recall by ~3 %, showing the benefit of dynamic scope selection.
- Efficiency: Average query latency on a single V100 GPU is ~45 ms, comparable to DPR’s 38 ms despite processing up to 8 k tokens per document.
Overall, the model delivers a large margin improvement while staying within the speed envelope required for real‑time RAG systems.
Practical Implications
- RAG Pipelines for Enterprise Docs – Companies can plug AttentionRetriever into existing LLM‑based assistants (e.g., Copilot‑style tools) to fetch precise sections from massive policy manuals, reducing hallucinations.
- Search‑as‑You‑Type for Long Content – UI developers can build autocomplete that surfaces relevant paragraphs from books or standards without pre‑segmenting the corpus.
- Legal & Compliance Automation – The entity‑driven approach naturally aligns with contract clause extraction, enabling faster compliance checks.
- Cost‑Effective Scaling – Because the heavy transformer encoder can be shared across queries, cloud providers can serve more requests per GPU, lowering operational expenses.
In short, developers now have a drop‑in retrieval component that respects the semantic span of long texts without sacrificing latency.
Limitations & Future Work
- Entity Dependency – The model relies on a reasonably accurate NER system; noisy entity tags (e.g., in low‑resource languages) can degrade performance.
- Fixed Chunk Size – Overlapping chunks of 512 tokens work well for English but may need tuning for languages with longer average token spans.
- Training Data Bias – The contrastive training set is dominated by news and Wikipedia; domain‑specific corpora (e.g., code repositories) may require additional fine‑tuning.
Future directions suggested by the authors include:
- Jointly learning NER and retrieval to reduce error propagation.
- Hierarchical attention that can handle truly massive documents (hundreds of thousands of tokens).
- Multilingual extensions leveraging cross‑lingual entity linking to broaden applicability.
Authors
- David Jiahao Fu
- Lam Thanh Do
- Jiayu Li
- Kevin Chen-Chuan Chang
Paper Information
- arXiv ID: 2602.12278v1
- Categories: cs.IR, cs.AI
- Published: February 12, 2026
- PDF: Download PDF