Why Converting PDFs to Markdown Is Harder Than It Looks

Published: 1 month ago (January 1, 2026 at 08:20 PM EST)

4 min read

Source: Dev.to

Introduction

When people hear “PDF to Markdown,” it often sounds like a simple text conversion task.
In reality, working with PDFs—especially if you care about structure—is one of the trickiest parsing problems any developer tool can encounter.

I ran into this repeatedly in documentation and LLM workflows, so I built a tool to tackle it. In this post, I’ll dig into why this problem is hard, what usually goes wrong, and how a structure‑aware pipeline can make Markdown outputs much more usable.

PDF Structure Basics

A PDF file does not encode paragraphs, headers, or tables as high‑level concepts the way HTML or Markdown does. Instead it contains:

Instructions to draw text at specific (x, y) coordinates
Drawing commands for images, shapes, paths
Transform matrices
Optional metadata

There is no “paragraph” object in the format. All structure must be inferred from:

Geometric proximity
Font size and style
Alignment and grouping

This makes the transition from PDF → Markdown fundamentally different from simple “text extraction.”

Types of PDFs

Native PDFs (text objects)

Many PDFs contain real text objects that can be read natively:

Extracted via libraries such as PyMuPDF or pdf.js
Include per‑span positions (bounding boxes)
Preserve font, glyph, and layout ordering

This is the best case for structural analysis.

Scanned PDFs (raster images)

Some PDFs are nothing but a stack of raster images (e.g., scans):

No text objects at all → everything must come from OCR
No layout metadata remains

These lack block information, so document structure must be reconstructed from visual cues. Detecting which path to take is an essential first step; treating scanned and native documents identically leads to poor outputs.

Common Failure Modes

Flat text dumps – Many PDF → Markdown tools simply dump text in reading order, resulting in:
- Line breaks in the wrong places
- Lost paragraph boundaries
- Broken lists
- Missing semantic grouping
The output may be Markdown, but it is rarely easy to work with.
Unnecessary OCR on native PDFs – Applying OCR to PDFs that already contain text:
- Introduces noise
- Loses formatting
- Adds unnecessary preprocessing
Orphaned images – Extracting images without knowing where they belong in the flow loses meaning, because image placement matters in Markdown.

Layout‑Aware Pipeline

The key realization is to treat PDFs as a set of layout blocks, each with:

Bounding box
Page number
Content type (text / image / table / code)

The pipeline then:

Sort all blocks by ascending (page, y, x).
Merge spans into paragraphs and paragraphs into higher‑level structures.
Reconstruct lists and tables based on geometric heuristics.
Insert images where they best fit relative to text blocks.

This approach doesn’t magically discover hidden semantics, but it creates Markdown that:

Is readable
Doesn’t require hours of cleanup
Respects structural relationships better than flat extraction

Handling Scanned PDFs

When native text blocks are absent, all blocks must be derived from visual content:

Layout info is lost → OCR provides the text.
Blocks are built from visual region detection.

This is a fundamentally different process than native parsing and must be treated as such. Tools like automatically detect scanned PDFs and route them to OCR‑based extraction. While OCR results are inherently noisier than native text extraction, they still provide usable Markdown where naive parsing would fail.

Tables

PDFs don’t represent tables explicitly. You infer structure from:

Column alignment
Row proximity
Grid lines (if present)

Standard Markdown tables cannot express rowspan/colspan. For complex layouts, an HTML table fallback is often preferable.

Lists

Bullets and indentation are visual cues only. Reconstructing nested lists requires:

Bullet pattern detection
Relative indentation comparison
Grouping across lines

These heuristics work reasonably well when implemented carefully.

Code Blocks

Code is often recognizable by:

Monospaced fonts
Consistent vertical spacing
Absence of list/table markers

Distinguishing code accurately improves readability of outputs for technical documentation.

Limitations

A perfect round‑trip from PDF to Markdown is impossible in the strict sense:

PDF has no semantic document model.
OCR has inherent error rates.
Layout inference is heuristic.

Nevertheless, a “good enough” solution is one where:

The Markdown is readable.
Structural elements aren’t mangled.
Images and tables aren’t orphaned.
Minimal manual cleanup is needed.

For documentation, note‑taking, or LLM workflows, this is far more important than pixel‑perfect fidelity.

Conclusion

PDF was designed for printing and visual fidelity, not semantic reuse. Converting it to Markdown is inherently a translation problem—from geometry to structure. A structure‑aware pipeline makes this translation far more reliable than naive extraction, and handling both native and scanned PDFs robustly is essential for real‑world use.

If you’d like to see a practical implementation of these ideas in action, check out . Feedback and edge‑case examples are always welcome.