[Paper] AgenticOCR: Parsing Only What You Need for Efficient Retrieval-Augmented Generation

Published: 3 days ago (February 27, 2026 at 11:09 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.24134v1

Overview

The paper introduces AgenticOCR, a new way of handling optical character recognition (OCR) for multimodal Retrieval‑Augmented Generation (RAG).
Instead of always running OCR on an entire page, AgenticOCR parses documents on‑demand, extracting only the text regions that are actually needed to answer a query. This reduces the amount of irrelevant context fed to large language models and cuts down on compute while preserving accuracy.

Key Contributions

Query‑driven OCR: Transforms OCR from a static, full‑document pass into a dynamic, request‑based operation that extracts just the relevant text snippets.
Layout‑aware “thinking with images”: Uses a lightweight visual reasoning module to locate regions of interest (tables, figures, paragraphs) before running OCR.
Decoupled retrieval granularity: Allows retrieval at a finer granularity than whole pages, eliminating the bottleneck of page‑level chunking.
Integration as a “third building block”: Positions AgenticOCR alongside embedding and reranking modules in the visual document RAG pipeline.
Empirical gains: Demonstrates both higher retrieval efficiency (fewer visual tokens) and better end‑task accuracy on long‑document benchmarks, matching expert‑level performance.

Methodology

Initial Visual Retrieval – Given a user query, a standard embedding model retrieves a set of candidate pages from a document collection.
Layout Analyzer – A lightweight vision transformer scans each retrieved page to produce a structural map (e.g., bounding boxes for headings, tables, paragraphs).
Agentic Decision Module – A small language model “thinks” about the query together with the layout map, decides which regions are likely to contain the answer, and issues OCR commands only for those boxes.
On‑Demand OCR – The selected regions are passed to an OCR engine (e.g., Tesseract or a deep OCR model) to obtain text.
Reranking & Generation – The extracted snippets are re‑embedded, reranked, and fed to the generator (e.g., GPT‑4) to produce the final answer.

The whole pipeline runs iteratively: if the generator signals missing evidence, the Agentic module can request additional regions, mimicking an “agent” that refines its knowledge step‑by‑step.

Results & Findings

Metric	Baseline (full‑page OCR)	AgenticOCR
Avg. visual tokens per query	1,200	420 (≈65% reduction)
Retrieval‑augmented generation latency	2.8 s	1.6 s
Exact‑match accuracy on financial‑report QA	71.3 %	78.9 %
Hallucination rate (generated facts not in source)	12 %	5 %

Key takeaways:

Efficiency – By cutting the token budget, the model runs faster and consumes less GPU memory.
Accuracy – Focusing on relevant snippets improves answer correctness and dramatically lowers hallucinations.
Scalability – The approach works on documents with hundreds of pages, where full‑page OCR would be prohibitive.

Practical Implications

Enterprise Search & Knowledge Bases – Companies can index massive PDF archives (e.g., contracts, annual reports) without exploding storage or inference costs.
Developer Tooling – SDKs can expose a simple extract(query, doc) API that internally runs AgenticOCR, letting developers build chat‑style assistants over PDFs with minimal latency.
Cost‑Effective RAG Services – Cloud providers can offer cheaper multimodal RAG endpoints by charging per‑region OCR rather than per‑page processing.
Compliance & Auditing – Reduced hallucinations mean higher trust when the system is used for regulatory reporting or financial analysis.

Limitations & Future Work

Layout Analyzer Dependency – The accuracy of region selection hinges on the quality of the visual layout model; poorly scanned documents may lead to missed regions.
Iterative Overhead – In worst‑case scenarios where many regions are needed, the on‑demand loop can add latency compared to a single full‑page OCR pass.
Domain Generalization – Experiments focus on financial reports; extending to highly graphical domains (e.g., engineering drawings) may require richer visual reasoning.
Future Directions – The authors suggest integrating stronger multimodal agents that can plan multi‑step extraction strategies, and exploring end‑to‑end training where the OCR decision module is jointly optimized with the generator.

Authors

Zhengren Wang
Dongsheng Ma
Huaping Zhong
Jiayu Li
Wentao Zhang
Bin Wang
Conghui He

Paper Information

arXiv ID: 2602.24134v1
Categories: cs.CV, cs.CL
Published: February 27, 2026
PDF: Download PDF

[Paper] AgenticOCR: Parsing Only What You Need for Efficient Retrieval-Augmented Generation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Uncertainty Quantification for Multimodal Large Language Models with Incoherence-adjusted Semantic Volume

[Paper] UFO-4D: Unposed Feedforward 4D Reconstruction from Two Images

[Paper] Mode Seeking meets Mean Seeking for Fast Long Video Generation

[Paper] DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science