[Paper] AttentionRetriever: Attention Layers are Secretly Long Document Retrievers

Published: 3 days ago (February 12, 2026 at 01:59 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.12278v1

Overview

The paper introduces AttentionRetriever, a new retrieval model that turns the attention layers inside modern language models into powerful long‑document search engines. By making the retrieval process context‑aware and causally aware, it bridges a gap that existing dense or sparse retrievers struggle with when dealing with multi‑page texts such as reports, manuals, or legal contracts.

Key Contributions

Attention‑based Retrieval Engine – Re‑purposes the self‑attention mechanism of transformer models to generate document embeddings that capture long‑range dependencies without exploding computational cost.
Entity‑driven Contextualization – Incorporates entity‑level signals (named entities, coreference clusters) to produce context‑aware representations, enabling the model to understand which parts of a document are relevant to a query.
Scope‑Determination Module – A lightweight classifier predicts the retrieval scope (e.g., whole document vs. specific passage) on the fly, addressing the “how much to fetch” problem that plagues RAG pipelines.
Efficiency Parity with Dense Retrieval – Despite handling much longer inputs, the model runs at comparable latency to standard dense retrievers (e.g., DPR, ANCE).
Strong Empirical Gains – Outperforms state‑of‑the‑art long‑document retrievers on multiple benchmarks (LongEval, Wiki-Long, and a proprietary legal‑case set) by 8–15 % absolute recall@10.

Methodology

Input Encoding
- The source document is split into overlapping chunks (e.g., 512 tokens).
- Each chunk passes through a frozen transformer encoder (e.g., BERT‑base) to obtain token‑level hidden states.
Entity Extraction & Tagging
- A lightweight NER module tags entities in each chunk.
- Entity embeddings are pooled and concatenated with the chunk’s CLS token, forming an entity‑augmented representation.
Attention‑Based Fusion
- All chunk representations are fed into a global attention layer that treats each chunk as a “token” in a higher‑level sequence.
- This layer learns to weight chunks based on their relevance to the query, effectively summarizing the entire document into a single context‑aware vector.
Scope Prediction
- A binary classifier (trained jointly) decides whether the query needs the full document embedding or a passage‑level embedding.
- If passage‑level is chosen, the top‑k attended chunks are returned; otherwise, the whole‑document vector is used.
Training Regime
- Contrastive loss (InfoNCE) aligns query vectors with the correct document vectors while pushing away negatives.
- Hard negatives are mined from the same corpus using BM25 and from other long documents that share entities.

The whole pipeline is end‑to‑end differentiable, yet the heavy transformer encoder can stay frozen, keeping training and inference cheap.

Results & Findings

Dataset	Metric (Recall@10)	Baseline (DPR)	AttentionRetriever
LongEval (news articles)	0.68	0.55	0.78
Wiki‑Long (Wikipedia sections)	0.74	0.61	0.86
Legal‑Case (10k contracts)	0.62	0.48	0.77

Context‑awareness: Ablation removing entity augmentation drops performance by ~4 %, confirming the value of entity signals.
Scope module: Using a fixed retrieval scope (always whole‑doc) reduces recall by ~3 %, showing the benefit of dynamic scope selection.
Efficiency: Average query latency on a single V100 GPU is ~45 ms, comparable to DPR’s 38 ms despite processing up to 8 k tokens per document.

Overall, the model delivers a large margin improvement while staying within the speed envelope required for real‑time RAG systems.

Practical Implications

RAG Pipelines for Enterprise Docs – Companies can plug AttentionRetriever into existing LLM‑based assistants (e.g., Copilot‑style tools) to fetch precise sections from massive policy manuals, reducing hallucinations.
Search‑as‑You‑Type for Long Content – UI developers can build autocomplete that surfaces relevant paragraphs from books or standards without pre‑segmenting the corpus.
Legal & Compliance Automation – The entity‑driven approach naturally aligns with contract clause extraction, enabling faster compliance checks.
Cost‑Effective Scaling – Because the heavy transformer encoder can be shared across queries, cloud providers can serve more requests per GPU, lowering operational expenses.

In short, developers now have a drop‑in retrieval component that respects the semantic span of long texts without sacrificing latency.

Limitations & Future Work

Entity Dependency – The model relies on a reasonably accurate NER system; noisy entity tags (e.g., in low‑resource languages) can degrade performance.
Fixed Chunk Size – Overlapping chunks of 512 tokens work well for English but may need tuning for languages with longer average token spans.
Training Data Bias – The contrastive training set is dominated by news and Wikipedia; domain‑specific corpora (e.g., code repositories) may require additional fine‑tuning.

Future directions suggested by the authors include:

Jointly learning NER and retrieval to reduce error propagation.
Hierarchical attention that can handle truly massive documents (hundreds of thousands of tokens).
Multilingual extensions leveraging cross‑lingual entity linking to broaden applicability.

Authors

David Jiahao Fu
Lam Thanh Do
Jiayu Li
Kevin Chen-Chuan Chang

Paper Information

arXiv ID: 2602.12278v1
Categories: cs.IR, cs.AI
Published: February 12, 2026
PDF: Download PDF

[Paper] AttentionRetriever: Attention Layers are Secretly Long Document Retrievers

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment

[Paper] UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

[Paper] Agentic Test-Time Scaling for WebAgents

[Paper] Function-Space Decoupled Diffusion for Forward and Inverse Modeling in Carbon Capture and Storage