[Paper] Benchmarking Document Parsers on Mathematical Formula Extraction from PDFs

Published: (December 10, 2025 at 01:01 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.09874v1

Overview

Parsing mathematical formulas from PDFs is a hidden bottleneck for anyone building scientific search engines, knowledge graphs, or training large language models on scholarly text. This paper presents a new, reproducible benchmark that evaluates how well modern PDF parsers can extract formulas, and it introduces a clever “LLM‑as‑judge” approach to assess the semantic correctness of the extracted LaTeX.

Key Contributions

  • Synthetic PDF benchmark – Generates PDFs with fully known LaTeX ground truth, allowing fine‑grained control over layout, font, and formula complexity.
  • LLM‑as‑judge evaluation – Uses large language models to score semantic similarity between extracted and reference formulas, validated against human judgments.
  • Two‑stage matching pipeline – Aligns parser outputs with ground‑truth formulas despite ordering and tokenization mismatches.
  • Comprehensive empirical study – Benchmarks 20+ state‑of‑the‑art PDF parsers (OCR‑based, vision‑language, rule‑based) on >2,000 formulas across 100 synthetic documents.
  • Open‑source release – All code, data, and evaluation scripts are publicly available (GitHub link in the paper).

Methodology

  1. Synthetic Document Generation – The authors programmatically create PDFs from LaTeX sources, varying column layouts, font sizes, and surrounding text. Because the source LaTeX is known, each formula has an exact ground‑truth representation.
  2. Parser Ingestion – Each PDF is fed to a suite of parsers. The parsers output either raw text, LaTeX snippets, or bounding‑box annotations.
  3. Two‑Stage Matching
    • Stage 1: Rough alignment based on spatial proximity and token overlap.
    • Stage 2: Refined matching using edit‑distance and structural heuristics to handle reordered or split formulas.
  4. Semantic Scoring – An LLM (e.g., GPT‑4) receives a pair of formulas (extracted vs. ground truth) and returns a similarity score (0–1). The authors calibrated this scoring against a human study of 250 formula pairs (30 evaluators, 750 ratings).
  5. Baseline Metrics – For comparison, they also compute CDM (character‑level distance metric) and plain text similarity (BLEU/ROUGE).

The pipeline is fully automated, making it easy to plug in new parsers or extend the synthetic corpus.

Results & Findings

Metric (Correlation with Human Scores)CDMText‑SimilarityLLM‑as‑Judge
Pearson r0.34~0.000.78
  • Performance spread: The best specialized OCR model achieved ~68 % formula‑level accuracy, while generic vision‑language models hovered around 30 %. Classic rule‑based tools performed worst (<15 %).
  • Error patterns: Most failures stem from mis‑recognizing superscripts/subscripts, mishandling multi‑line equations, and breaking long formulas across columns.
  • Scalability: The LLM‑as‑judge approach scales linearly with the number of formulas and requires only a few API calls per pair, making it practical for large‑scale evaluations.

Practical Implications

  • Tool selection: Developers building pipelines for scientific document ingestion now have a data‑driven way to pick a parser that meets their accuracy‑vs‑speed trade‑off.
  • Training data pipelines: When curating corpora for LLM pre‑training, using the benchmark can filter out low‑quality formula extractions, improving downstream math reasoning capabilities.
  • Knowledge‑base construction: Accurate LaTeX extraction enables reliable indexing of equations for search, citation analysis, and automated theorem‑proving assistants.
  • Benchmark as a service: Because the synthetic generator and evaluation scripts are open, teams can continuously benchmark internal OCR improvements without needing costly human annotation.

Limitations & Future Work

  • Synthetic vs. real PDFs: While synthetic PDFs give perfect ground truth, they may not capture all quirks of scanned legacy documents (e.g., noise, compression artifacts).
  • LLM dependence: The semantic scoring relies on proprietary LLM APIs; variations in model versions could affect reproducibility.
  • Formula complexity ceiling: Extremely long or highly nested expressions (>30 tokens) still see degraded LLM scoring reliability.
  • Future directions suggested by the authors include extending the benchmark to real‑world PDFs with partial human annotation, exploring open‑source LLMs for the judge role, and integrating end‑to‑end pipelines that combine OCR, layout analysis, and formula reconstruction.

Authors

  • Pius Horn
  • Janis Keuper

Paper Information

  • arXiv ID: 2512.09874v1
  • Categories: cs.CV
  • Published: December 10, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »