[Paper] PubMed-OCR: PMC Open Access OCR Annotations
Source: arXiv - 2601.11425v1
Overview
The PubMed-OCR dataset turns the massive collection of open‑access biomedical PDFs in PubMed Central into a machine‑readable, layout‑aware resource. By running Google Cloud Vision OCR over 1.5 million pages and packaging the results in a lightweight JSON schema, the authors provide developers with ready‑to‑use ground truth for tasks that need both text and its visual coordinates (e.g., document layout analysis, OCR‑aware question answering, and end‑to‑end scientific‑paper pipelines).
Key Contributions
- Largest OCR‑annotated scientific‑paper corpus to date: ~209 K articles, 1.5 M page images, and ~1.3 B word tokens.
- Rich hierarchical annotations (word, line, paragraph) with precise bounding boxes, all stored in a compact, query‑friendly JSON format.
- Open‑access release under a permissive license, enabling reproducible research and easy integration into existing pipelines.
- Baseline analyses of journal coverage, layout diversity (tables, figures, multi‑column text), and OCR quality metrics.
- Discussion of practical constraints (single OCR engine, heuristic line reconstruction) to guide future extensions.
Methodology
- Corpus selection – Harvested all Open Access PDFs from PubMed Central (PMC) that are freely downloadable and legally reusable.
- Image extraction – Rasterized each PDF page into a high‑resolution PNG for OCR processing.
- OCR processing – Used Google Cloud Vision (GCV) as the sole OCR backend; GCV returns word‑level text together with x‑y coordinates.
- Post‑processing
- Line reconstruction – Merged words whose bounding boxes aligned horizontally and were within a distance threshold into lines.
- Paragraph grouping – Clustered consecutive lines with similar indentation and vertical spacing into paragraphs.
- Schema design – Stored annotations per page in a JSON object containing three top‑level arrays (
words,lines,paragraphs). Each entry holds the text string and a list of four corner coordinates, making it trivial to overlay the data on the original image. - Quality checks – Computed basic OCR metrics (character error rate on a small hand‑annotated subset) and inspected layout statistics (column count, presence of figures/tables) to verify coverage and spot systematic errors.
Results & Findings
- Coverage: The dataset spans a broad spectrum of biomedical journals, with >90 % of PMC’s Open Access titles represented.
- Layout diversity: Approximately 45 % of pages are multi‑column; 12 % contain embedded figures or tables, confirming that the corpus captures realistic scientific‑paper layouts.
- OCR accuracy: On a 5 K‑word validation set, the GCV engine achieved a character error rate (CER) of ~2.8 % and word error rate (WER) of ~5.4 %, comparable to other large‑scale OCR benchmarks.
- Data compactness: The JSON representation reduces storage to ~150 GB (≈ 0.1 GB per 1 M words), far smaller than raw image + OCR text dumps, facilitating fast loading in training loops.
- Baseline tasks: Demonstrated two downstream use‑cases—(a) a layout‑aware named‑entity recognizer that leverages paragraph coordinates, and (b) a coordinate‑grounded question‑answering model that can point to the exact region on a page where an answer appears.
Practical Implications
- Accelerated OCR‑dependent pipelines – Developers building literature‑mining tools can skip the costly OCR step and directly ingest high‑quality, spatially indexed text.
- Layout‑aware NLP models – By feeding coordinate information, models can learn to differentiate headings, captions, and body text, improving entity extraction and summarization for scientific documents.
- Document AI research – The dataset serves as a benchmark for multi‑modal tasks such as visual document understanding, table extraction, and figure caption linking, all of which are hot topics in industry (e.g., automated contract analysis, invoice processing).
- Fine‑grained QA and retrieval – Enables “search‑by‑region” in digital libraries or chat‑bots that can point users to the exact page snippet with coordinate‑grounded answers.
- Open‑source ecosystem – Because the schema is JSON‑first and the data is openly licensed, it can be seamlessly integrated with popular ML frameworks (PyTorch, TensorFlow) and data‑processing tools (Apache Arrow, Dask).
Limitations & Future Work
- Single OCR engine – Relying exclusively on Google Cloud Vision inherits its systematic biases (e.g., struggles with certain fonts or low‑contrast figures). A multi‑engine ensemble could improve robustness.
- Heuristic line reconstruction – Rule‑based merging of words into lines may mis‑group words in dense tables or heavily formatted sections; a learned line‑segmentation model could replace this step.
- Domain focus – While biomedical literature is vast, the dataset does not cover other scientific domains (physics, computer science) where layout conventions differ. Extending the pipeline to other corpora would broaden applicability.
- Ground‑truth validation – Only a small subset was manually verified for OCR errors; larger human‑in‑the‑loop evaluations would provide stronger confidence for high‑stakes applications.
The authors invite the community to contribute additional OCR back‑ends, improve line/paragraph heuristics, and expand the corpus beyond PubMed Central, turning PubMed‑OCR into a living benchmark for document‑centric AI.
Authors
- Hunter Heidenreich
- Yosheb Getachew
- Olivia Dinica
- Ben Elliott
Paper Information
- arXiv ID: 2601.11425v1
- Categories: cs.CV, cs.CL, cs.DL, cs.LG
- Published: January 16, 2026
- PDF: Download PDF