[Paper] pdfQA: Diverse, Challenging, and Realistic Question Answering over PDFs

Published: 2 weeks ago (January 5, 2026 at 12:15 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.02285v1

Overview

PDF documents are everywhere—from research papers to product manuals—yet most question‑answering (QA) datasets are built on plain text or narrowly scoped sources. The pdfQA paper introduces a new, large‑scale benchmark that captures the messiness of real‑world PDFs, offering both human‑annotated and synthetically generated QA pairs across ten difficulty dimensions. By doing so, it gives developers a realistic testbed for end‑to‑end PDF‑QA pipelines.

Key Contributions

Dual‑mode dataset: 2 K human‑annotated (real‑pdfQA) and 2 K synthetic (syn‑pdfQA) QA pairs, covering a wide variety of document types and domains.
Ten complexity dimensions (e.g., file format quirks, source modality, answer type, location in the file) that let you slice the data by difficulty.
Quality‑and‑difficulty filtering pipeline that automatically discards low‑quality or trivially easy pairs, ensuring a challenging benchmark.
Comprehensive evaluation of several open‑source large language models (LLMs) on the dataset, exposing concrete failure modes tied to the defined dimensions.
Open‑source release of the data, annotation guidelines, and evaluation scripts, enabling reproducible research and rapid integration into existing QA systems.

Methodology

Data Collection
- Real PDFs: Curated from 10 public domains (academic papers, product datasheets, legal contracts, etc.). Human annotators read each PDF and wrote natural‑language questions plus exact answer spans.
- Synthetic PDFs: Generated by programmatically converting diverse source formats (HTML, Markdown, LaTeX) into PDFs, then automatically extracting text and creating QA pairs with a language model, followed by human verification.
Complexity Annotation
For every QA pair, annotators tagged ten attributes such as:
- File type (vector vs. scanned image)
- Source modality (text, table, figure caption)
- Source position (header, footnote, body)
- Answer type (numeric, boolean, span, multi‑span)
  This structured labeling lets researchers filter by specific challenges.
Filtering Pipeline
- Quality filter: Checks for answer‑question relevance, correct span alignment, and OCR confidence (for scanned PDFs).
- Difficulty filter: Uses heuristic scores (e.g., length of answer, presence of tables/figures) to keep only pairs that are non‑trivial for current models.
Model Evaluation
Open‑source LLMs (e.g., Llama‑2‑13B, Mistral‑7B) were fine‑tuned on a generic QA corpus and then tested on pdfQA without any PDF‑specific preprocessing. Retrieval was performed using BM25 over extracted text, and the final answer was generated by the LLM. Performance was broken down by each complexity dimension.

Results & Findings

Model	Exact Match (EM)	F1	Drop vs. plain‑text QA
Llama‑2‑13B	31.2 %	44.8 %	–12 pp
Mistral‑7B	28.9 %	42.1 %	–15 pp

Hardest dimensions: Scanned image PDFs, answers embedded in tables, and multi‑span answers caused the steepest performance drops.
Retrieval bottleneck: BM25 struggled with layout‑aware queries (e.g., “What is the value in the second column of Table 3?”), leading to low recall.
Parsing errors: OCR mis‑recognitions in scanned PDFs accounted for ~30 % of the failures, even before the LLM saw the text.
Model awareness: Larger models showed modest gains on complex answer types but still fell short of human performance (≈78 % EM on the human‑annotated set).

Practical Implications

End‑to‑end pipeline testing: pdfQA lets engineers benchmark every stage—OCR, layout parsing, retrieval, and LLM inference—under realistic conditions.
Targeted improvements: By slicing the benchmark along the ten dimensions, teams can prioritize fixes (e.g., better table extraction or OCR post‑processing) that yield the biggest accuracy gains.
Productization: Companies building AI assistants for technical documentation, legal contracts, or scientific literature can use pdfQA to validate that their system works not just on clean HTML but on the messy PDFs their users actually upload.
Fine‑tuning data: The synthetic portion provides a scalable source of diverse PDF QA pairs for domain‑specific model adaptation without the cost of massive human annotation.

Limitations & Future Work

Scale: At ~4 K QA pairs, pdfQA is still modest compared to massive web‑scale QA corpora; larger datasets would better capture long‑tail PDF quirks.
Domain coverage: Although multi‑domain, some high‑risk sectors (e.g., medical records, financial statements) are under‑represented.
Retrieval baseline: The study used a simple BM25 retriever; future work could explore neural dense retrieval or multimodal indexing that respects visual layout.
Dynamic PDFs: Interactive or encrypted PDFs were excluded; handling these formats remains an open challenge.

By exposing the hidden complexities of PDF‑based question answering, pdfQA offers a practical roadmap for developers aiming to build robust, real‑world document AI systems.

Authors

Tobias Schimanski
Imene Kolli
Jingwei Ni
Yu Fan
Ario Saeid Vaghefi
Elliott Ash
Markus Leippold

Paper Information

arXiv ID: 2601.02285v1
Categories: cs.CL, cs.AI
Published: January 5, 2026
PDF: Download PDF

[Paper] pdfQA: Diverse, Challenging, and Realistic Question Answering over PDFs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Do explanations generalize across large reasoning models?

[Paper] Building Production-Ready Probes For Gemini

[Paper] The Poisoned Apple Effect: Strategic Manipulation of Mediated Markets via Technology Expansion of AI Agents

[Paper] MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models