[Paper] ExtractBench: A Benchmark and Evaluation Methodology for Complex Structured Extraction
Source: arXiv - 2602.12247v1
Overview
The paper introduces ExtractBench, an open‑source benchmark and evaluation framework that measures how well large language models (LLMs) can turn PDF documents into structured JSON data. By pairing real‑world PDFs with detailed JSON schemas and human‑verified ground truth, the authors expose a critical weakness: even the most advanced LLMs (GPT‑5, Gemini‑3, Claude 4.5) struggle to reliably produce correct, schema‑conformant output when the target schema grows beyond a few dozen fields.
Key Contributions
- End‑to‑end PDF‑to‑JSON benchmark covering 35 PDFs from high‑value domains (finance, legal, medical, etc.) with 12,867 annotated fields.
- Enterprise‑scale schemas ranging from ~20 to 369 fields, reflecting the complexity of real production pipelines.
- Executable schema specifications: each field declares its own scoring metric (exact match, tolerance‑based, semantic equivalence, array alignment, omission vs. hallucination).
- Comprehensive evaluation suite for frontier LLMs (GPT‑5/5.2, Gemini‑3 Flash/Pro, Claude 4.5 Opus/Sonnet).
- Open‑source release (code, data, and evaluation scripts) at https://github.com/ContextualAI/extract-bench, enabling reproducibility and community extensions.
Methodology
- Document & Schema Curation – The team collected 35 PDFs that are typical in enterprise workflows (e.g., quarterly financial reports, legal contracts). For each PDF they authored a JSON Schema that captures every piece of information a downstream system would need.
- Human Annotation – Expert annotators extracted the ground‑truth values for every schema field, producing a gold‑standard JSON file.
- Scoring Specification – Each schema field includes a metric descriptor:
- Exact for identifiers (e.g., invoice numbers).
- Tolerance for numeric quantities (e.g., “$1,234.56 ± $0.01”).
- Semantic for names or categories (e.g., “Acme Corp” ≈ “Acme Corporation”).
- Array alignment for ordered lists (e.g., line‑item tables).
- Omission vs. hallucination flags to separate missing data from fabricated values.
- Model Prompting – Standardized prompts ask the LLM to read the PDF (via OCR or built‑in PDF support) and output a JSON object that conforms to the supplied schema.
- Evaluation Engine – The framework parses the model’s JSON, validates it against the schema, and applies the per‑field metric to compute precision, recall, and an overall valid‑output rate (the percentage of runs that satisfy all field constraints).
Results & Findings
| Model | Avg. Field Accuracy | Valid‑Output Rate (≤50‑field schema) | Valid‑Output Rate (369‑field schema) |
|---|---|---|---|
| GPT‑5 / 5.2 | 71 % | 38 % | 0 % |
| Gemini‑3 Flash | 68 % | 34 % | 0 % |
| Gemini‑3 Pro | 73 % | 41 % | 0 % |
| Claude 4.5 Opus | 69 % | 36 % | 0 % |
| Claude 4.5 Sonnet | 65 % | 32 % | 0 % |
- Sharp degradation with schema breadth – Accuracy drops roughly 0.2 % per additional field, and the chance of producing a completely valid JSON collapses once the schema exceeds ~200 fields.
- Error patterns – Models frequently hallucinate fields that are not present, mis‑align array items, and mishandle tolerance‑based numeric comparisons.
- No model achieved reliable end‑to‑end extraction for the largest financial reporting schema (369 fields), highlighting a gap between research‑grade LLM capabilities and production‑grade data pipelines.
Practical Implications
- Enterprise Automation Caution – Companies looking to replace custom parsers with LLM‑based extraction should not assume “out‑of‑the‑box” reliability, especially for large, nested schemas common in finance, compliance, and healthcare.
- Prompt Engineering & Post‑Processing – The findings suggest that robust pipelines will need layered validation, fallback parsers, or human‑in‑the‑loop checks to catch omissions and hallucinations.
- Benchmark‑Driven Development – ExtractBench gives product teams a concrete yardstick to measure improvements when fine‑tuning models, adding domain‑specific prompts, or integrating OCR enhancements.
- Standardized Schema Contracts – By treating the schema as an executable contract, developers can embed the same validation logic directly into their services, turning schema errors into actionable alerts rather than silent data corruption.
- Open‑source Community – The released benchmark can serve as a shared testbed for new LLMs, retrieval‑augmented generation (RAG) pipelines, or specialized extraction tools, accelerating convergence on reliable PDF‑to‑JSON solutions.
Limitations & Future Work
- Domain Coverage – While the benchmark spans several high‑value sectors, it still omits niche domains (e.g., scientific literature, engineering drawings) that may exhibit different extraction challenges.
- OCR Dependency – The study assumes a reasonably accurate OCR front‑end; errors introduced before the LLM see the text are not isolated in the reported metrics.
- Static Prompts – Only a single prompting style was evaluated per model; more sophisticated chain‑of‑thought or tool‑use prompts could improve performance but were not explored.
- Scalability of Human Annotation – Scaling the gold‑standard creation to thousands of PDFs would be costly; future work could investigate semi‑automated labeling or weak supervision to expand the benchmark.
- Model‑Specific Fine‑Tuning – The paper evaluates zero‑shot LLMs; fine‑tuning or instruction‑tuning on extraction tasks may close the performance gap, a promising direction for follow‑up research.
Authors
- Nick Ferguson
- Josh Pennington
- Narek Beghian
- Aravind Mohan
- Douwe Kiela
- Sheshansh Agrawal
- Thien Hang Nguyen
Paper Information
- arXiv ID: 2602.12247v1
- Categories: cs.LG, cs.AI
- Published: February 12, 2026
- PDF: Download PDF