[Paper] pdfQA: Diverse, Challenging, and Realistic Question Answering over PDFs

Published: (January 5, 2026 at 12:15 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.02285v1

Overview

PDF documents are everywhere—from research papers to product manuals—yet most question‑answering (QA) datasets are built on plain text or narrowly scoped sources. The pdfQA paper introduces a new, large‑scale benchmark that captures the messiness of real‑world PDFs, offering both human‑annotated and synthetically generated QA pairs across ten difficulty dimensions. By doing so, it gives developers a realistic testbed for end‑to‑end PDF‑QA pipelines.

Key Contributions

  • Dual‑mode dataset: 2 K human‑annotated (real‑pdfQA) and 2 K synthetic (syn‑pdfQA) QA pairs, covering a wide variety of document types and domains.
  • Ten complexity dimensions (e.g., file format quirks, source modality, answer type, location in the file) that let you slice the data by difficulty.
  • Quality‑and‑difficulty filtering pipeline that automatically discards low‑quality or trivially easy pairs, ensuring a challenging benchmark.
  • Comprehensive evaluation of several open‑source large language models (LLMs) on the dataset, exposing concrete failure modes tied to the defined dimensions.
  • Open‑source release of the data, annotation guidelines, and evaluation scripts, enabling reproducible research and rapid integration into existing QA systems.

Methodology

  1. Data Collection

    • Real PDFs: Curated from 10 public domains (academic papers, product datasheets, legal contracts, etc.). Human annotators read each PDF and wrote natural‑language questions plus exact answer spans.
    • Synthetic PDFs: Generated by programmatically converting diverse source formats (HTML, Markdown, LaTeX) into PDFs, then automatically extracting text and creating QA pairs with a language model, followed by human verification.
  2. Complexity Annotation
    For every QA pair, annotators tagged ten attributes such as:

    • File type (vector vs. scanned image)
    • Source modality (text, table, figure caption)
    • Source position (header, footnote, body)
    • Answer type (numeric, boolean, span, multi‑span)
      This structured labeling lets researchers filter by specific challenges.
  3. Filtering Pipeline

    • Quality filter: Checks for answer‑question relevance, correct span alignment, and OCR confidence (for scanned PDFs).
    • Difficulty filter: Uses heuristic scores (e.g., length of answer, presence of tables/figures) to keep only pairs that are non‑trivial for current models.
  4. Model Evaluation
    Open‑source LLMs (e.g., Llama‑2‑13B, Mistral‑7B) were fine‑tuned on a generic QA corpus and then tested on pdfQA without any PDF‑specific preprocessing. Retrieval was performed using BM25 over extracted text, and the final answer was generated by the LLM. Performance was broken down by each complexity dimension.

Results & Findings

ModelExact Match (EM)F1Drop vs. plain‑text QA
Llama‑2‑13B31.2 %44.8 %–12 pp
Mistral‑7B28.9 %42.1 %–15 pp
  • Hardest dimensions: Scanned image PDFs, answers embedded in tables, and multi‑span answers caused the steepest performance drops.
  • Retrieval bottleneck: BM25 struggled with layout‑aware queries (e.g., “What is the value in the second column of Table 3?”), leading to low recall.
  • Parsing errors: OCR mis‑recognitions in scanned PDFs accounted for ~30 % of the failures, even before the LLM saw the text.
  • Model awareness: Larger models showed modest gains on complex answer types but still fell short of human performance (≈78 % EM on the human‑annotated set).

Practical Implications

  • End‑to‑end pipeline testing: pdfQA lets engineers benchmark every stage—OCR, layout parsing, retrieval, and LLM inference—under realistic conditions.
  • Targeted improvements: By slicing the benchmark along the ten dimensions, teams can prioritize fixes (e.g., better table extraction or OCR post‑processing) that yield the biggest accuracy gains.
  • Productization: Companies building AI assistants for technical documentation, legal contracts, or scientific literature can use pdfQA to validate that their system works not just on clean HTML but on the messy PDFs their users actually upload.
  • Fine‑tuning data: The synthetic portion provides a scalable source of diverse PDF QA pairs for domain‑specific model adaptation without the cost of massive human annotation.

Limitations & Future Work

  • Scale: At ~4 K QA pairs, pdfQA is still modest compared to massive web‑scale QA corpora; larger datasets would better capture long‑tail PDF quirks.
  • Domain coverage: Although multi‑domain, some high‑risk sectors (e.g., medical records, financial statements) are under‑represented.
  • Retrieval baseline: The study used a simple BM25 retriever; future work could explore neural dense retrieval or multimodal indexing that respects visual layout.
  • Dynamic PDFs: Interactive or encrypted PDFs were excluded; handling these formats remains an open challenge.

By exposing the hidden complexities of PDF‑based question answering, pdfQA offers a practical roadmap for developers aiming to build robust, real‑world document AI systems.

Authors

  • Tobias Schimanski
  • Imene Kolli
  • Jingwei Ni
  • Yu Fan
  • Ario Saeid Vaghefi
  • Elliott Ash
  • Markus Leippold

Paper Information

  • arXiv ID: 2601.02285v1
  • Categories: cs.CL, cs.AI
  • Published: January 5, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »