[Paper] Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference
Source: arXiv - 2602.10021v1
Overview
The paper introduces DRIFT (Decoupled Reasoning with Implicit Fact Tokens), a dual‑model framework that separates knowledge retrieval from reasoning in large language models (LLMs). By compressing retrieved documents into dense “fact tokens” that the reasoning model can consume directly, DRIFT sidesteps the classic context‑window bottleneck and delivers stronger performance on long‑context inference tasks.
Key Contributions
- Dual‑model architecture: a lightweight knowledge model that extracts and encodes relevant facts, and a reasoning model that performs inference using those encoded facts.
- Implicit fact tokens: query‑conditioned dense representations that replace raw text chunks, dramatically reducing token count while preserving semantic content.
- Dynamic, on‑the‑fly compression: unlike static prompt‑compression methods, DRIFT generates fact tokens per query, allowing flexible handling of ever‑changing knowledge sources.
- Empirical gains: consistent improvements over strong baselines (e.g., Retrieval‑Augmented Generation, prompt‑compression) across several long‑context benchmarks, with comparable model sizes.
- Open‑source implementation: code released on GitHub, facilitating replication and integration into existing pipelines.
Methodology
-
Knowledge Model (Retriever‑Encoder)
- Takes a user query and a set of retrieved document chunks (e.g., from BM25 or a neural retriever).
- Encodes each chunk into a fixed‑size vector conditioned on the query (so the same chunk can yield different fact tokens for different questions).
- These vectors are called implicit fact tokens.
-
Projection Layer
- The fact tokens are linearly projected into the embedding space of the reasoning model, ensuring compatibility without fine‑tuning the reasoning model’s tokenizer.
-
Reasoning Model (Generator)
- Receives the query plus the sequence of fact tokens (instead of the raw text).
- Performs standard autoregressive generation, treating fact tokens as ordinary tokens in its context window.
-
Training Loop
- The knowledge model is trained jointly with the reasoning model on a mixture of synthetic and real long‑context datasets.
- A contrastive loss encourages fact tokens to be informative for the downstream answer while staying compact.
-
Inference Pipeline
- Retrieve top‑k documents → encode each into fact tokens → feed tokens + query to the reasoning model → generate answer.
- Because each document chunk collapses to a single token, a 32 KB context can be represented with < 200 tokens, well within typical LLM windows.
Results & Findings
| Benchmark | Baseline (RAG) | Prompt‑Compression | DRIFT |
|---|---|---|---|
| Long‑Form QA (30 k tokens) | 58.2 % EM | 61.5 % EM | 68.9 % EM |
| Multi‑Document Summarization | 44.7 % ROUGE‑L | 46.3 % ROUGE‑L | 53.1 % ROUGE‑L |
| Code‑Generation with Docs | 31.4 % Pass@1 | 33.0 % Pass@1 | 38.7 % Pass@1 |
- Token efficiency: DRIFT reduces the effective context length by ~85 % while keeping answer quality.
- Scalability: Gains hold across model sizes (7B‑13B) and different retriever back‑ends, indicating the approach is model‑agnostic.
- Robustness to noise: Because fact tokens are query‑aware, irrelevant retrieved passages have minimal impact on the final answer.
Practical Implications
- Extended context without larger models – Developers can feed massive knowledge bases (e.g., product manuals, legal corpora) to a modest‑size LLM and still get coherent answers.
- Cost‑effective inference – Fewer tokens mean lower API usage fees and faster latency, which is crucial for real‑time assistants or edge deployments.
- Modular pipelines – The knowledge model can be swapped out or fine‑tuned independently, allowing teams to specialize retrieval for their domain (e.g., medical literature) without retraining the main LLM.
- Dynamic knowledge updates – Since fact tokens are generated on‑the‑fly, adding or editing documents does not require re‑training the reasoning model, mitigating catastrophic forgetting.
- Potential for plug‑and‑play SDKs – The open‑source code can be wrapped into a microservice that sits between any LLM API and a document store, offering a drop‑in “long‑context enhancer”.
Limitations & Future Work
- Retriever dependence – DRIFT’s performance still hinges on the quality of the initial document retrieval; noisy retrievers can propagate errors into fact tokens.
- Training overhead – Jointly training the knowledge encoder adds a non‑trivial pre‑training cost, especially for very large corpora.
- Interpretability – Implicit fact tokens are dense vectors; debugging why a particular answer was generated is harder than with raw text snippets.
- Future directions suggested by the authors include: (1) exploring hierarchical fact‑token generation for ultra‑long documents, (2) integrating retrieval‑aware fine‑tuning of the reasoning model, and (3) adding lightweight token‑level attribution mechanisms to improve explainability.
Authors
- Wenxuan Xie
- Yujia Wang
- Xin Tan
- Chaochao Lu
- Xia Hu
- Xuhong Wang
Paper Information
- arXiv ID: 2602.10021v1
- Categories: cs.CL, cs.AI
- Published: February 10, 2026
- PDF: Download PDF