[Paper] Agentar-Fin-OCR
Source: arXiv - 2603.11044v1
Overview
The paper introduces Agentar‑Fin‑OCR, a purpose‑built OCR and document‑parsing pipeline for ultra‑long, complex financial PDFs (annual reports, prospectuses, regulatory filings, etc.). By tackling cross‑page layout breaks and cell‑level referencing, the system delivers structured, audit‑grade outputs that can be directly consumed by downstream analytics, compliance, and reporting tools.
Key Contributions
- Cross‑page Contents Consolidation – an algorithm that stitches fragmented tables and sections spread over multiple pages, restoring logical continuity.
- Document‑level Heading Hierarchy Reconstruction (DHR) – builds a global Table of Contents (TOC) tree, enabling structure‑aware search and retrieval across the whole document.
- Difficulty‑adaptive Curriculum Learning for table parsing – the model is trained progressively from easy to hard table layouts, boosting robustness on real‑world financial tables.
- CellBBoxRegressor – a decoder‑only module that predicts precise cell bounding boxes from hidden states using structural anchor tokens, eliminating the need for separate object detectors.
- FinDocBench – a new benchmark suite covering six financial document categories with expert‑verified annotations and novel metrics (TOC edit‑distance similarity, cross‑page TEDS, Cell IoU).
- Comprehensive evaluation of state‑of‑the‑art OCR and table‑parsing models on FinDocBench, highlighting gaps that Agentar‑Fin‑OCR closes.
Methodology
Pre‑processing & Layout Detection
- PDFs are rasterized page‑by‑page.
- A backbone vision encoder (e.g., Swin‑Transformer) extracts visual tokens and predicts coarse layout elements (paragraphs, tables, figures).
Cross‑page Consolidation
- Detected tables that are split across page boundaries are linked via a similarity‑based matching of header/footer cues and content embeddings.
- The matched fragments are merged into a single logical table representation before parsing.
Heading Hierarchy Reconstruction (DHR)
- Heading tokens are classified by level (H1‑H4) using a lightweight classifier.
- A tree‑building algorithm connects headings based on indentation, font cues, and positional hierarchy, producing a global TOC that can be queried like a JSON tree.
Table Parsing with Curriculum Learning
- Training data is ordered from simple (single‑page, uniform grids) to complex (spanning pages, merged cells, multi‑column headers).
- The model learns to predict cell content sequences and, via the CellBBoxRegressor, directly regresses cell bounding boxes from decoder states, using anchor tokens (e.g., “<cell‑start>”) as reference points.
Post‑processing & Provenance
- Every extracted cell is tagged with source page numbers and coordinates, providing an audit trail required for compliance use‑cases.
Results & Findings
| Metric | Agentar‑Fin‑OCR | Best Prior Model |
|---|---|---|
| Table‑level TEDS (cross‑page) | 0.92 | 0.78 |
| TOC Edit‑Distance Similarity (TocEDS) | 0.95 | 0.81 |
| Cell IoU (C‑IoU) | 0.88 | 0.71 |
| End‑to‑end processing time (10‑page PDF) | 3.2 s | 4.7 s |
- Accuracy gains are especially pronounced on multi‑page tables and deeply nested headings, where prior models often lose alignment.
- Curriculum learning contributed ~4 % absolute improvement on the hardest table subset.
- The CellBBoxRegressor removed the need for an external detector, cutting inference latency by ~30 %.
FinDocBench revealed that many off‑the‑shelf OCR solutions still struggle with cross‑page continuity and financial‑specific jargon, confirming the practical relevance of the proposed system.
Practical Implications
- Compliance Automation – Auditors can ingest a regulator‑mandated filing and instantly retrieve a verified, page‑referenced table of financial statements, reducing manual cross‑checking.
- Data‑Lake Ingestion – Enterprises can stream massive batches of annual reports into a structured data lake (JSON/Parquet) without custom parsing scripts.
- Search & Retrieval – The global TOC tree enables semantic search (“find all cash‑flow tables in 2023 reports”) with sub‑second latency.
- Financial Modeling – Quant teams can pull cell‑level data (e.g., “Operating Income – Q4”) directly into models, preserving provenance for audit trails.
- Reduced Vendor Lock‑in – Because the system is end‑to‑end and does not rely on third‑party detectors, it can be packaged as a self‑contained microservice or integrated into existing document‑management pipelines.
Limitations & Future Work
- Domain Generalization – While tuned for finance, the current curriculum and anchor token set may need re‑training for other verticals (legal, medical).
- Hand‑written Annotations – The pipeline assumes printed text; handwritten signatures or marginalia are not yet supported.
- Scalability to Gigabyte‑scale PDFs – Experiments capped at ~50 pages; handling multi‑hundred‑page filings will require memory‑efficient tiling strategies.
- Explainability – The internal decision process for cross‑page merging is heuristic; future work aims to make it fully learnable and provide confidence scores.
Overall, Agentar‑Fin‑OCR and the accompanying FinDocBench set a new baseline for reliable, production‑grade financial document parsing, opening the door for more automated, audit‑ready data pipelines in the finance industry.
Paper Information
- arXiv ID: 2603.11044v1
- Categories: cs.CV
- Published: March 11, 2026
- PDF: Download PDF