[Paper] ExStrucTiny: A Benchmark for Schema-Variable Structured Information Extraction from Document Images

Published: 3 days ago (February 12, 2026 at 12:38 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.12203v1

Overview

The paper introduces ExStrucTiny, a new benchmark that pushes Vision‑Language Models (VLMs) to extract structured information from a wide variety of document images—think invoices, tax forms, medical reports, and more. By blending key‑entity extraction, relation extraction, and visual question answering into a single testbed, the authors expose how current “generalist” VLMs struggle when asked to adapt to changing schemas, ambiguous queries, and the need to point to exact locations in the document.

Key Contributions

A unified benchmark that combines KEE, RE, and VQA tasks on real‑world document images, covering many document types and flexible output schemas.
A hybrid data‑creation pipeline that mixes manually annotated samples with high‑quality synthetic data validated by humans, yielding a diverse yet reliable dataset.
Comprehensive evaluation of both open‑source and closed‑source VLMs, revealing concrete failure modes such as schema‑drift, under‑specified queries, and poor answer localization.
Diagnostic analysis tools (schema‑adaptation metrics, query‑specific difficulty scores) that help researchers pinpoint where models fall short.
Open‑source release of the dataset, evaluation scripts, and baseline results to foster reproducibility and community‑driven improvements.

Methodology

Data Collection & Synthesis
- Gathered a set of publicly available enterprise documents (forms, reports, receipts).
- Designed a schema‑variable template language that can express arbitrary field hierarchies (e.g., Customer → Address → Zip).
- Generated synthetic documents by programmatically populating these templates with realistic content, then had human annotators verify correctness.
Task Unification
- Each sample presents a query (natural‑language or schema‑driven) and expects a structured answer (key‑value pairs, relational triples, or a JSON‑like object).
- The benchmark also asks the model to localize each answer element on the image (bounding box or polygon).
Model Evaluation
- Tested several VLMs (e.g., LayoutLMv3, Donut, GPT‑4V) in two modes:
  Closed‑book: model sees only the image and query.
  Open‑book: model receives an explicit schema definition as additional context.
- Metrics include exact‑match on the structured output, schema‑adaptation score, and Intersection‑over‑Union (IoU) for localization.

Results & Findings

Overall performance is modest: even the strongest closed‑book VLM achieved ~38 % exact‑match, dropping to ~22 % when the schema changed from training distribution.
Open‑book models improve (up to ~45 % exact‑match) but still falter on under‑specified queries where the model must infer missing context.
Localization is the weakest link: average IoU hovers around 0.31, indicating that models can often name the right field but cannot reliably point to it on the page.
Schema variability hurts: models trained on a fixed set of fields see a steep accuracy decline (≈15 % absolute) when presented with a new schema layout.
Human‑validated synthetic data bridges gaps: adding synthetic samples boosts performance by ~6 % on unseen document types, showing the value of scalable data augmentation.

Practical Implications

Enterprise automation pipelines (e.g., invoice processing, claim triage) need models that can flexibly adapt to new form layouts without re‑training; ExStrucTiny highlights the current gap and offers a testbed for iterative improvement.
Low‑code/no‑code document AI platforms can use the benchmark to benchmark their “schema‑as‑prompt” features, ensuring that end‑users can define custom extraction schemas on the fly.
Developers building VLM‑powered bots (e.g., chat assistants that answer questions about PDFs) should be aware that answer localization is still unreliable—additional post‑processing (OCR + rule‑based anchoring) may be required.
Data‑centric AI teams can leverage the synthetic‑plus‑human pipeline to cheaply expand their training corpora for niche document types (e.g., customs declarations) while maintaining quality.

Limitations & Future Work

The benchmark focuses on English‑language documents; multilingual or right‑to‑left scripts remain untested.
Synthetic data, despite human validation, may not capture all the visual noise (stamps, handwritten notes) found in legacy scanned archives.
The current evaluation treats each query independently; real‑world workflows often involve multiple interdependent queries, an aspect the authors plan to explore.
Future work includes extending the schema language to support nested relations (e.g., line‑item tables) and integrating a few‑shot adaptation protocol to better simulate on‑the‑fly schema changes.

Authors

Mathieu Sibue
Andres Muñoz Garza
Samuel Mensah
Pranav Shetty
Zhiqiang Ma
Xiaomo Liu
Manuela Veloso

Paper Information

arXiv ID: 2602.12203v1
Categories: cs.CL
Published: February 12, 2026
PDF: Download PDF

[Paper] ExStrucTiny: A Benchmark for Schema-Variable Structured Information Extraction from Document Images

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Agentic Test-Time Scaling for WebAgents

[Paper] On-Policy Context Distillation for Language Models

[Paper] T3D: Few-Step Diffusion Language Models via Trajectory Self-Distillation with Direct Discriminative Optimization

[Paper] A technical curriculum on language-oriented artificial intelligence in translation and specialised communication