[Paper] ExtractBench: A Benchmark and Evaluation Methodology for Complex Structured Extraction

Published: 3 days ago (February 12, 2026 at 01:31 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.12247v1

Overview

The paper introduces ExtractBench, an open‑source benchmark and evaluation framework that measures how well large language models (LLMs) can turn PDF documents into structured JSON data. By pairing real‑world PDFs with detailed JSON schemas and human‑verified ground truth, the authors expose a critical weakness: even the most advanced LLMs (GPT‑5, Gemini‑3, Claude 4.5) struggle to reliably produce correct, schema‑conformant output when the target schema grows beyond a few dozen fields.

Key Contributions

End‑to‑end PDF‑to‑JSON benchmark covering 35 PDFs from high‑value domains (finance, legal, medical, etc.) with 12,867 annotated fields.
Enterprise‑scale schemas ranging from ~20 to 369 fields, reflecting the complexity of real production pipelines.
Executable schema specifications: each field declares its own scoring metric (exact match, tolerance‑based, semantic equivalence, array alignment, omission vs. hallucination).
Comprehensive evaluation suite for frontier LLMs (GPT‑5/5.2, Gemini‑3 Flash/Pro, Claude 4.5 Opus/Sonnet).
Open‑source release (code, data, and evaluation scripts) at https://github.com/ContextualAI/extract-bench, enabling reproducibility and community extensions.

Methodology

Document & Schema Curation – The team collected 35 PDFs that are typical in enterprise workflows (e.g., quarterly financial reports, legal contracts). For each PDF they authored a JSON Schema that captures every piece of information a downstream system would need.
Human Annotation – Expert annotators extracted the ground‑truth values for every schema field, producing a gold‑standard JSON file.
Scoring Specification – Each schema field includes a metric descriptor:
- Exact for identifiers (e.g., invoice numbers).
- Tolerance for numeric quantities (e.g., “$1,234.56 ± $0.01”).
- Semantic for names or categories (e.g., “Acme Corp” ≈ “Acme Corporation”).
- Array alignment for ordered lists (e.g., line‑item tables).
- Omission vs. hallucination flags to separate missing data from fabricated values.
Model Prompting – Standardized prompts ask the LLM to read the PDF (via OCR or built‑in PDF support) and output a JSON object that conforms to the supplied schema.
Evaluation Engine – The framework parses the model’s JSON, validates it against the schema, and applies the per‑field metric to compute precision, recall, and an overall valid‑output rate (the percentage of runs that satisfy all field constraints).

Results & Findings

Model	Avg. Field Accuracy	Valid‑Output Rate (≤50‑field schema)	Valid‑Output Rate (369‑field schema)
GPT‑5 / 5.2	71 %	38 %	0 %
Gemini‑3 Flash	68 %	34 %	0 %
Gemini‑3 Pro	73 %	41 %	0 %
Claude 4.5 Opus	69 %	36 %	0 %
Claude 4.5 Sonnet	65 %	32 %	0 %

Sharp degradation with schema breadth – Accuracy drops roughly 0.2 % per additional field, and the chance of producing a completely valid JSON collapses once the schema exceeds ~200 fields.
Error patterns – Models frequently hallucinate fields that are not present, mis‑align array items, and mishandle tolerance‑based numeric comparisons.
No model achieved reliable end‑to‑end extraction for the largest financial reporting schema (369 fields), highlighting a gap between research‑grade LLM capabilities and production‑grade data pipelines.

Practical Implications

Enterprise Automation Caution – Companies looking to replace custom parsers with LLM‑based extraction should not assume “out‑of‑the‑box” reliability, especially for large, nested schemas common in finance, compliance, and healthcare.
Prompt Engineering & Post‑Processing – The findings suggest that robust pipelines will need layered validation, fallback parsers, or human‑in‑the‑loop checks to catch omissions and hallucinations.
Benchmark‑Driven Development – ExtractBench gives product teams a concrete yardstick to measure improvements when fine‑tuning models, adding domain‑specific prompts, or integrating OCR enhancements.
Standardized Schema Contracts – By treating the schema as an executable contract, developers can embed the same validation logic directly into their services, turning schema errors into actionable alerts rather than silent data corruption.
Open‑source Community – The released benchmark can serve as a shared testbed for new LLMs, retrieval‑augmented generation (RAG) pipelines, or specialized extraction tools, accelerating convergence on reliable PDF‑to‑JSON solutions.

Limitations & Future Work

Domain Coverage – While the benchmark spans several high‑value sectors, it still omits niche domains (e.g., scientific literature, engineering drawings) that may exhibit different extraction challenges.
OCR Dependency – The study assumes a reasonably accurate OCR front‑end; errors introduced before the LLM see the text are not isolated in the reported metrics.
Static Prompts – Only a single prompting style was evaluated per model; more sophisticated chain‑of‑thought or tool‑use prompts could improve performance but were not explored.
Scalability of Human Annotation – Scaling the gold‑standard creation to thousands of PDFs would be costly; future work could investigate semi‑automated labeling or weak supervision to expand the benchmark.
Model‑Specific Fine‑Tuning – The paper evaluates zero‑shot LLMs; fine‑tuning or instruction‑tuning on extraction tasks may close the performance gap, a promising direction for follow‑up research.

Authors

Nick Ferguson
Josh Pennington
Narek Beghian
Aravind Mohan
Douwe Kiela
Sheshansh Agrawal
Thien Hang Nguyen

Paper Information

arXiv ID: 2602.12247v1
Categories: cs.LG, cs.AI
Published: February 12, 2026
PDF: Download PDF

[Paper] ExtractBench: A Benchmark and Evaluation Methodology for Complex Structured Extraction

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment

[Paper] UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

[Paper] AttentionRetriever: Attention Layers are Secretly Long Document Retrievers

[Paper] Agentic Test-Time Scaling for WebAgents