[Paper] AgenticOCR: Parsing Only What You Need for Efficient Retrieval-Augmented Generation
Source: arXiv - 2602.24134v1
Overview
The paper introduces AgenticOCR, a new way of handling optical character recognition (OCR) for multimodal Retrieval‑Augmented Generation (RAG).
Instead of always running OCR on an entire page, AgenticOCR parses documents on‑demand, extracting only the text regions that are actually needed to answer a query. This reduces the amount of irrelevant context fed to large language models and cuts down on compute while preserving accuracy.
Key Contributions
- Query‑driven OCR: Transforms OCR from a static, full‑document pass into a dynamic, request‑based operation that extracts just the relevant text snippets.
- Layout‑aware “thinking with images”: Uses a lightweight visual reasoning module to locate regions of interest (tables, figures, paragraphs) before running OCR.
- Decoupled retrieval granularity: Allows retrieval at a finer granularity than whole pages, eliminating the bottleneck of page‑level chunking.
- Integration as a “third building block”: Positions AgenticOCR alongside embedding and reranking modules in the visual document RAG pipeline.
- Empirical gains: Demonstrates both higher retrieval efficiency (fewer visual tokens) and better end‑task accuracy on long‑document benchmarks, matching expert‑level performance.
Methodology
- Initial Visual Retrieval – Given a user query, a standard embedding model retrieves a set of candidate pages from a document collection.
- Layout Analyzer – A lightweight vision transformer scans each retrieved page to produce a structural map (e.g., bounding boxes for headings, tables, paragraphs).
- Agentic Decision Module – A small language model “thinks” about the query together with the layout map, decides which regions are likely to contain the answer, and issues OCR commands only for those boxes.
- On‑Demand OCR – The selected regions are passed to an OCR engine (e.g., Tesseract or a deep OCR model) to obtain text.
- Reranking & Generation – The extracted snippets are re‑embedded, reranked, and fed to the generator (e.g., GPT‑4) to produce the final answer.
The whole pipeline runs iteratively: if the generator signals missing evidence, the Agentic module can request additional regions, mimicking an “agent” that refines its knowledge step‑by‑step.
Results & Findings
| Metric | Baseline (full‑page OCR) | AgenticOCR |
|---|---|---|
| Avg. visual tokens per query | 1,200 | 420 (≈65% reduction) |
| Retrieval‑augmented generation latency | 2.8 s | 1.6 s |
| Exact‑match accuracy on financial‑report QA | 71.3 % | 78.9 % |
| Hallucination rate (generated facts not in source) | 12 % | 5 % |
Key takeaways:
- Efficiency – By cutting the token budget, the model runs faster and consumes less GPU memory.
- Accuracy – Focusing on relevant snippets improves answer correctness and dramatically lowers hallucinations.
- Scalability – The approach works on documents with hundreds of pages, where full‑page OCR would be prohibitive.
Practical Implications
- Enterprise Search & Knowledge Bases – Companies can index massive PDF archives (e.g., contracts, annual reports) without exploding storage or inference costs.
- Developer Tooling – SDKs can expose a simple
extract(query, doc)API that internally runs AgenticOCR, letting developers build chat‑style assistants over PDFs with minimal latency. - Cost‑Effective RAG Services – Cloud providers can offer cheaper multimodal RAG endpoints by charging per‑region OCR rather than per‑page processing.
- Compliance & Auditing – Reduced hallucinations mean higher trust when the system is used for regulatory reporting or financial analysis.
Limitations & Future Work
- Layout Analyzer Dependency – The accuracy of region selection hinges on the quality of the visual layout model; poorly scanned documents may lead to missed regions.
- Iterative Overhead – In worst‑case scenarios where many regions are needed, the on‑demand loop can add latency compared to a single full‑page OCR pass.
- Domain Generalization – Experiments focus on financial reports; extending to highly graphical domains (e.g., engineering drawings) may require richer visual reasoning.
- Future Directions – The authors suggest integrating stronger multimodal agents that can plan multi‑step extraction strategies, and exploring end‑to‑end training where the OCR decision module is jointly optimized with the generator.
Authors
- Zhengren Wang
- Dongsheng Ma
- Huaping Zhong
- Jiayu Li
- Wentao Zhang
- Bin Wang
- Conghui He
Paper Information
- arXiv ID: 2602.24134v1
- Categories: cs.CV, cs.CL
- Published: February 27, 2026
- PDF: Download PDF