[Paper] ExStrucTiny: A Benchmark for Schema-Variable Structured Information Extraction from Document Images

Published: (February 12, 2026 at 12:38 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.12203v1

Overview

The paper introduces ExStrucTiny, a new benchmark that pushes Vision‑Language Models (VLMs) to extract structured information from a wide variety of document images—think invoices, tax forms, medical reports, and more. By blending key‑entity extraction, relation extraction, and visual question answering into a single testbed, the authors expose how current “generalist” VLMs struggle when asked to adapt to changing schemas, ambiguous queries, and the need to point to exact locations in the document.

Key Contributions

  • A unified benchmark that combines KEE, RE, and VQA tasks on real‑world document images, covering many document types and flexible output schemas.
  • A hybrid data‑creation pipeline that mixes manually annotated samples with high‑quality synthetic data validated by humans, yielding a diverse yet reliable dataset.
  • Comprehensive evaluation of both open‑source and closed‑source VLMs, revealing concrete failure modes such as schema‑drift, under‑specified queries, and poor answer localization.
  • Diagnostic analysis tools (schema‑adaptation metrics, query‑specific difficulty scores) that help researchers pinpoint where models fall short.
  • Open‑source release of the dataset, evaluation scripts, and baseline results to foster reproducibility and community‑driven improvements.

Methodology

  1. Data Collection & Synthesis

    • Gathered a set of publicly available enterprise documents (forms, reports, receipts).
    • Designed a schema‑variable template language that can express arbitrary field hierarchies (e.g., Customer → Address → Zip).
    • Generated synthetic documents by programmatically populating these templates with realistic content, then had human annotators verify correctness.
  2. Task Unification

    • Each sample presents a query (natural‑language or schema‑driven) and expects a structured answer (key‑value pairs, relational triples, or a JSON‑like object).
    • The benchmark also asks the model to localize each answer element on the image (bounding box or polygon).
  3. Model Evaluation

    • Tested several VLMs (e.g., LayoutLMv3, Donut, GPT‑4V) in two modes:
      Closed‑book: model sees only the image and query.
      Open‑book: model receives an explicit schema definition as additional context.
    • Metrics include exact‑match on the structured output, schema‑adaptation score, and Intersection‑over‑Union (IoU) for localization.

Results & Findings

  • Overall performance is modest: even the strongest closed‑book VLM achieved ~38 % exact‑match, dropping to ~22 % when the schema changed from training distribution.
  • Open‑book models improve (up to ~45 % exact‑match) but still falter on under‑specified queries where the model must infer missing context.
  • Localization is the weakest link: average IoU hovers around 0.31, indicating that models can often name the right field but cannot reliably point to it on the page.
  • Schema variability hurts: models trained on a fixed set of fields see a steep accuracy decline (≈15 % absolute) when presented with a new schema layout.
  • Human‑validated synthetic data bridges gaps: adding synthetic samples boosts performance by ~6 % on unseen document types, showing the value of scalable data augmentation.

Practical Implications

  • Enterprise automation pipelines (e.g., invoice processing, claim triage) need models that can flexibly adapt to new form layouts without re‑training; ExStrucTiny highlights the current gap and offers a testbed for iterative improvement.
  • Low‑code/no‑code document AI platforms can use the benchmark to benchmark their “schema‑as‑prompt” features, ensuring that end‑users can define custom extraction schemas on the fly.
  • Developers building VLM‑powered bots (e.g., chat assistants that answer questions about PDFs) should be aware that answer localization is still unreliable—additional post‑processing (OCR + rule‑based anchoring) may be required.
  • Data‑centric AI teams can leverage the synthetic‑plus‑human pipeline to cheaply expand their training corpora for niche document types (e.g., customs declarations) while maintaining quality.

Limitations & Future Work

  • The benchmark focuses on English‑language documents; multilingual or right‑to‑left scripts remain untested.
  • Synthetic data, despite human validation, may not capture all the visual noise (stamps, handwritten notes) found in legacy scanned archives.
  • The current evaluation treats each query independently; real‑world workflows often involve multiple interdependent queries, an aspect the authors plan to explore.
  • Future work includes extending the schema language to support nested relations (e.g., line‑item tables) and integrating a few‑shot adaptation protocol to better simulate on‑the‑fly schema changes.

Authors

  • Mathieu Sibue
  • Andres Muñoz Garza
  • Samuel Mensah
  • Pranav Shetty
  • Zhiqiang Ma
  • Xiaomo Liu
  • Manuela Veloso

Paper Information

  • arXiv ID: 2602.12203v1
  • Categories: cs.CL
  • Published: February 12, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »