[Paper] POTATR: A Lightweight Image-to-Graph Model for Page-Level Table Extraction

Published: 3 days ago (June 8, 2026 at 01:43 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.09788v1

Overview

Large-scale document processing requires contextually aware table extraction (TE) that is both accurate and efficient. Yet current approaches require billions of parameters, hundreds of autoregressive steps, or costly API inference. Motivated by this, we introduce the Page-Object Table Transformer (POTATR), a lightweight 29M parameter image-to-graph model that extends the Table Transformer (TATR) for contextualized page-level TE. POTATR outperforms all models tested on the PubTables-v2 Single Pages benchmark — including frontier MLLMs — achieving $\textrm{GriTS}_\textrm{Con}$ of 0.964 while running over 130$\times$ faster at roughly 300$\times$ lower cost. Further, POTATR’s output is spatially grounded: every recognized element has a bounding box, enabling visual verification and geometric text assignment. As a result, POTATR performs unified page-level TE while composing with other models, enabling extension to scanned documents via external OCR and to full-document TE via techniques like cross-page merging. Code and models will be released.

Key Contributions

This paper presents research in the following areas:

cs.CV

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CV.

Authors

Brandon Smock
Libin Liang
Max Sokolov
Amrit Ramesh
Valerie Faucon-Morin
Tayyibah Khanam
Maury Courtland

Paper Information

arXiv ID: 2606.09788v1
Categories: cs.CV
Published: June 8, 2026
PDF: Download PDF

[Paper] POTATR: A Lightweight Image-to-Graph Model for Page-Level Table Extraction

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

[Paper] How Seemingly Inconsequential Design Choices Dictate Performance of LLMs in Pathology

[Paper] DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?

[Paper] VLGA: Vision-Language-Geometry-Action Models for Autonomous Driving