Why Converting PDFs to Markdown Is Harder Than It Looks
Source: Dev.to
Introduction
When people hear “PDF to Markdown,” it often sounds like a simple text conversion task.
In reality, working with PDFs—especially if you care about structure—is one of the trickiest parsing problems any developer tool can encounter.
I ran into this repeatedly in documentation and LLM workflows, so I built a tool to tackle it. In this post, I’ll dig into why this problem is hard, what usually goes wrong, and how a structure‑aware pipeline can make Markdown outputs much more usable.
PDF Structure Basics
A PDF file does not encode paragraphs, headers, or tables as high‑level concepts the way HTML or Markdown does. Instead it contains:
- Instructions to draw text at specific (x, y) coordinates
- Drawing commands for images, shapes, paths
- Transform matrices
- Optional metadata
There is no “paragraph” object in the format. All structure must be inferred from:
- Geometric proximity
- Font size and style
- Alignment and grouping
This makes the transition from PDF → Markdown fundamentally different from simple “text extraction.”
Types of PDFs
Native PDFs (text objects)
Many PDFs contain real text objects that can be read natively:
- Extracted via libraries such as PyMuPDF or pdf.js
- Include per‑span positions (bounding boxes)
- Preserve font, glyph, and layout ordering
This is the best case for structural analysis.
Scanned PDFs (raster images)
Some PDFs are nothing but a stack of raster images (e.g., scans):
- No text objects at all → everything must come from OCR
- No layout metadata remains
These lack block information, so document structure must be reconstructed from visual cues. Detecting which path to take is an essential first step; treating scanned and native documents identically leads to poor outputs.
Common Failure Modes
-
Flat text dumps – Many PDF → Markdown tools simply dump text in reading order, resulting in:
- Line breaks in the wrong places
- Lost paragraph boundaries
- Broken lists
- Missing semantic grouping
The output may be Markdown, but it is rarely easy to work with.
-
Unnecessary OCR on native PDFs – Applying OCR to PDFs that already contain text:
- Introduces noise
- Loses formatting
- Adds unnecessary preprocessing
-
Orphaned images – Extracting images without knowing where they belong in the flow loses meaning, because image placement matters in Markdown.
Layout‑Aware Pipeline
The key realization is to treat PDFs as a set of layout blocks, each with:
- Bounding box
- Page number
- Content type (text / image / table / code)
The pipeline then:
- Sort all blocks by ascending
(page, y, x). - Merge spans into paragraphs and paragraphs into higher‑level structures.
- Reconstruct lists and tables based on geometric heuristics.
- Insert images where they best fit relative to text blocks.
This approach doesn’t magically discover hidden semantics, but it creates Markdown that:
- Is readable
- Doesn’t require hours of cleanup
- Respects structural relationships better than flat extraction
Handling Scanned PDFs
When native text blocks are absent, all blocks must be derived from visual content:
- Layout info is lost → OCR provides the text.
- Blocks are built from visual region detection.
This is a fundamentally different process than native parsing and must be treated as such. Tools like automatically detect scanned PDFs and route them to OCR‑based extraction. While OCR results are inherently noisier than native text extraction, they still provide usable Markdown where naive parsing would fail.
Tables
PDFs don’t represent tables explicitly. You infer structure from:
- Column alignment
- Row proximity
- Grid lines (if present)
Standard Markdown tables cannot express rowspan/colspan. For complex layouts, an HTML table fallback is often preferable.
Lists
Bullets and indentation are visual cues only. Reconstructing nested lists requires:
- Bullet pattern detection
- Relative indentation comparison
- Grouping across lines
These heuristics work reasonably well when implemented carefully.
Code Blocks
Code is often recognizable by:
- Monospaced fonts
- Consistent vertical spacing
- Absence of list/table markers
Distinguishing code accurately improves readability of outputs for technical documentation.
Limitations
A perfect round‑trip from PDF to Markdown is impossible in the strict sense:
- PDF has no semantic document model.
- OCR has inherent error rates.
- Layout inference is heuristic.
Nevertheless, a “good enough” solution is one where:
- The Markdown is readable.
- Structural elements aren’t mangled.
- Images and tables aren’t orphaned.
- Minimal manual cleanup is needed.
For documentation, note‑taking, or LLM workflows, this is far more important than pixel‑perfect fidelity.
Conclusion
PDF was designed for printing and visual fidelity, not semantic reuse. Converting it to Markdown is inherently a translation problem—from geometry to structure. A structure‑aware pipeline makes this translation far more reliable than naive extraction, and handling both native and scanned PDFs robustly is essential for real‑world use.
If you’d like to see a practical implementation of these ideas in action, check out . Feedback and edge‑case examples are always welcome.