Parse Scanned PDFs for RAG with EasyOCR: Free OCR Gives You Words, Not a Document
Source: Towards Data Science
in Enterprise Document Intelligence, the series that builds an enterprise RAG system from four bricks. Article 5 (document parsing) built the parser with PyMuPDF (fitz), which returns empty on a scanned page with no text layer. This companion swaps the engine for EasyOCR, a free OCR package that recovers that text. It is the one case in this family where the new engine gives you less, not more: it recovers the text and nothing around it, and that gap is the lesson.
where this companion sits: it extends Article 5 (document parsing), inside Part II (the four bricks), with a different parsing engine – Image by author
Scanned PDFs are not solved by “just throw OCR at it”. The OCR step recovers text; that’s necessary but not sufficient for an enterprise RAG pipeline. What the pipeline also needs is everything around the text: where the page boundaries are, which lines are section headings, what is a figure, what is a table row vs a free paragraph. “Traditional OCR” (the term of art for text-detection + text-recognition engines like EasyOCR, Tesseract, PaddleOCR) gives you the text. It gives you nothing else. The rest is the layout problem, and the layout problem is the harder half.
This article runs that distinction concretely. The traditional-OCR engine is EasyOCR: the simplest, fastest, free, JaidedAI’s text-detection + recognition library (Apache 2.0, declared in the project’s LICENSE file). The layout-aware engine is Docling (Article 5ter; MIT license, declared in the project’s LICENSE file). Both can OCR a scanned page. They differ on what they do with the result. The whole article is a setup for the head-to-head on a real public-domain 1974 scan in section 5.
EasyOCR is the OCR floor: line_df only, no layout. The rest of the family adds structure – Image by author
1. What “traditional OCR” does (and doesn’t)
Traditional OCR reads pixels and returns text rectangles. Everything else, sections, tables, figures, reading order, is a separate layout problem the engine refuses to look at. The two models behind it are text detection (find rectangular regions of the image that contain text) and text recognition (read each region’s pixels and return characters with a confidence score). The output is a flat list of (bbox, text, confidence) per detected region.
That is everything EasyOCR (or Tesseract, or PaddleOCR) does. The engine reads pixels and returns text rectangles. A two-column page comes back as a flat list of left-and-right text boxes intermixed by y-coordinate; the engine does not know there are two columns. A table comes back as a grid of disconnected cells the engine cannot tell apart from regular paragraphs. A figure caption is just another text box. The page header, page footer, marginalia all show up as boxes too.
Anything that needs “this text is a section heading” or “these four boxes are one table row” needs a second model on top, a layout model. The layout model reads the OCR output plus the page image and classifies each region (heading, paragraph, table cell, figure, caption, footer…) and groups them into a reading order. That is what Article 5bis (Azure DI), Article 5ter (Docling), and Article 5quater (vision LLM) all add over the OCR step. Without one, you have “OCR output”, not “a parsed document”.
2. EasyOCR: the canonical traditional OCR
EasyOCR is the cleanest demonstration of “traditional OCR” as a class. The library is small (~150 MB of model weights cached on first call), free, CPU-only by default, local. The whole library API is two calls: build a Reader for the languages you need, then hand readtext an image. Each detection comes back as a triple: the polygon around the text, the recognised string, and the recogniser’s own confidence.
import easyocr
import fitz
import numpy as np
reader = easyocr.Reader(["en"], gpu=False) # first call downloads ~150 MB
# render page 1 of a scanned PDF to a numpy array EasyOCR can read
page = fitz.open("data/contracts/scanned_amendment.pdf")[0]
pix = page.get_pixmap(matrix=fitz.Matrix(2.0, 2.0)) # 2x zoom = ~144 DPI
img = np.frombuffer(pix.samples, dtype=np.uint8).reshape(
pix.height, pix.width, pix.n,
)
# the recogniser: one image in, one triple per detected text region out
detections = reader.readtext(img)
for quad, text, conf in detections:
# quad = [[x0,y0], [x1,y0], [x1,y1], [x0,y1]] in pixel coords
print(round(conf, 2), text)
parse_pdf_easyocr wraps that loop. It walks every page of the PDF, renders each to a numpy array, calls readtext, converts the pixel-space polygons back to PDF coordinates, and packs the detections into the same dict-of-tables contract as the other parsers, same line_df, same parsing_summary, same downstream consumers, except that only those two keys carry data. Every other slot (page_df, image_df, toc_df, span_df, object_registry, cross_ref_df) comes back as an empty DataFrame. That isn’t a missing-feature bug; it’s exactly what “traditional OCR” means.
parsed = parse_pdf_easyocr(
"data/contracts/scanned_amendment.pdf",
languages=("en",), # add "fr", "de", ... for multilingual scans
render_scale=2.0, # 2.0 = ~144 DPI ; raise for small fonts
gpu=False, # CPU-only by default ; set True if CUDA available
confidence_threshold=0.0, # filter low-confidence detections if needed
)
parsed["line_df"] # text + bbox + confidence per detection
parsed["parsing_summary"] # method, page count, line count, render scale
# Every other key (page_df, image_df, toc_df, span_df, object_registry,
# cross_ref_df) is an empty DataFrame ; EasyOCR has nothing to put there.
The signature kwargs are the only knobs:
-
languages: tuple of ISO-639-1 codes (en,fr,de,zh, …). A multilingual corpus loads one Reader per language set; the@lru_cacheinget_easyocr_readerkeeps a handful of these in memory across calls. -
render_scale: how many pixels per PDF unit when rasterising each page.1.0is native (~72 DPI, often too small).2.0is the sweet spot for body text. Raise to3.0for tiny fonts; lower if you’re memory-bound. -
gpu: CPU is the default so the module works on any machine. CUDA gives a 3-5x speedup on text-heavy pages. -
confidence_threshold: drop low-confidence detections.0.0keeps everything (the column is preserved so downstream code can filter),0.3cuts most noise on degraded scans.
3. What line_df looks like
Sample rows from the NIST FIPS 199 cover (US Government work, public domain in the US, see NIST copyright statement), one per detected text region: the page coordinate, the OCR’d text, and the recogniser’s own confidence score. That is the whole output.
Same column shape as fitz’s line_df, plus a confidence column EasyOCR adds for free – Image by author
The shape is deliberately small:
-
text+ bbox: the recogniser’s payload, one row per detected text region. -
confidence: float between 0 and 1, EasyOCR’s self-score. Useful both as a filter (drop below 0.3 on noisy scans) and as a feedback signal (Article 8’s generation can flag low-confidence passages to the user). -
character_count: kept for symmetry with the other parsers; on EasyOCR it’s justlen(text). -
No column / reading-order column. A two-column page comes back as a flat list, left-and-right boxes intermixed by y-coordinate.
Every other key in the returned dict (page_df, image_df, toc_df, span_df, object_registry, cross_ref_df) is an empty DataFrame. A consumer that calls parsed["image_df"] does not crash; it iterates an empty frame.
4. What traditional OCR misses, the layout gap, item by item
Five structural artefacts that the RAG pipeline needs and that traditional OCR cannot produce, regardless of how big the recognition model is. Each one breaks a downstream operation the rest of the series relies on.
-
TOC / sections. Cross-reference resolution (Article 11) and section-scoped corpus retrieval (Article 17) both rely on
toc_df. EasyOCR returns zero rows. The dispatcher cannot route “answer in Section 3.2” questions because Section 3.2 has no boundary. -
Individual figures inside the page. A scanned 30-page contract may contain six chart screenshots embedded in the body text. EasyOCR treats the whole page as one image and returns text from around the figures; the figures themselves never become rows. A downstream pipeline that needs to retrieve “the chart on page 14” has no handle.
-
Reading order on multi-column / multi-zone pages. A two-column scientific paper page comes back top-to-bottom across both columns intermixed: left-line-1, right-line-1, left-line-2, right-line-2… Generation reads garbage. Sidebars, footnotes, marginalia all leak into the main flow.
-
Table cells. A scanned schedule of charges or premium table comes back as a flat list of disconnected text boxes (the row labels in one column, the values in another, the unit headers somewhere else). The relationship “this value belongs to this label” is lost. Article 5 (document parsing) opens on exactly this failure mode (“the parser walked the table cell by cell and joined them into a flat string”). Layout-aware engines run a separate TableFormer-style model to reconstruct rows × columns × headers.
-
Font / weight / size signals. OCR recovers character shape, not its typographic encoding. “This line is in bold 18pt” is information the layout engine reads from the page rendering; EasyOCR throws it away. Headings, emphasis, footnotes lose the cue that would have classified them.
Take the third one, reading order, because it is the one that quietly corrupts an answer. EasyOCR returns text boxes sorted by their y-coordinate. On a two-column page the two columns sit at the same heights, so the boxes come back interleaved: first line of the left column, first line of the right column, second line of the left, and so on. The prose reads as a zigzag, and generation quotes the zigzag.
with no layout model, boxes come back sorted by y, so a two-column page interleaves into a zigzag – Image by author
The single sentence: the OCR step recovers text, the layout step recovers what makes the text usable. Article 5ter (Docling) and Article 5bis (Azure DI) add the layout step on top of the same OCR. Article 5quater (vision LLM) folds the two into one call. EasyOCR stops at the OCR step.
5. EasyOCR vs Docling on a real scanned PDF
On the same 1974 scan, Docling extracts more characters (5,423 vs 4,952), the page boundaries, eleven TOC entries, and four figure regions. EasyOCR extracts text rectangles and stops. The two engines agree at the character level, both OCR with the same recogniser-class accuracy, but Docling’s layout pass turns the OCR output into a document.
The interesting comparison is not against fitz (fitz returns zero on a scan) but against the next engine up: Docling, the local layout-aware parser from Article 5ter. The comparison is cleaner than it looks: Docling’s default OCR backend is EasyOCR itself. Same recognizer reading the same pixels; the difference is everything Docling builds around it.
The test case is a real public-domain scan: pages 1–5 of karg74.pdf, the 1974 USAF MULTICS Security Evaluation (Karger & Schell, ESD-TR-74-193 Vol. II). NIST hosts it in their Early Computer Security Papers archive; the work is in the public domain as the output of US Air Force officers. The PDF has Adobe’s “Paper Capture” OCR layer baked in, but we ignore it, both engines re-OCR from page images, which is the realistic scenario when the embedded OCR (when present) is unreliable.
The real comparison. Both re-OCR the page images; Docling adds layout – Image by author
The two columns tell different stories.
EasyOCR (left). Faster (59.7 s vs 134.4 s, no layout model to load and run), ships the recogniser’s confidence as a column (mean 0.81 on this scan), produces more row-level detections (346 boxes) because every text region in the page becomes one row. Zero structure: no page_df, no toc_df, no image_df. The output is text in bbox form, nothing else.
Docling (right). Slower (2.3× more compute), joins detections into 105 lines/paragraphs rather than 346 boxes, no confidence column. The structural gain is real: 5 page_df rows, 11 toc_df entries (Docling’s layout model classifies headings as sections), 4 image_df rows (figures detected inside the page as separate objects). On a PDF with tables, the gap widens further, Docling’s TableFormer recognises rows × columns × headers, which EasyOCR cannot do at all. Article 5ter develops the table case in full.
Both engines OCR with similar character-level error rates on this 1974 scan (Karger → “Karger” by EasyOCR, “Karger” by Docling on the cleanest page; degraded regions yield similar noise on both, “Laboralory”, “und” instead of “and”). The OCR engine inside Docling (EasyOCR or OnnxTR depending on install) is not magically more accurate than calling EasyOCR directly. What Docling adds is how it organises the OCR output, not how it OCRs.
For enterprise RAG, the right call is Docling almost always. The 2.3× compute is paid once at ingestion (the parse cache from Article 5 (document parsing) reuses results forever); the structural gain (TOC, figures, table cells, reading order) is paid back on every downstream query. The one thing Docling does not ship is EasyOCR’s row-level confidence signal, which is rarely worth giving up sections + figures + tables.
6. When traditional OCR still earns its keep
EasyOCR is the emergency package of the family: less visibility into the document, simpler dependencies, faster to deploy when the constraint is operational rather than pedagogical. Four narrow cases keep the door open.
-
Receipt-class documents. A 1-page invoice or receipt with no headings, no sections, no figures, no tables-with-merged-cells. Layout is trivial; the recogniser is the whole job. Adding Docling’s 2× compute buys structure the document doesn’t have.
-
Per-region confidence as a generation feedback signal. Generation (Article 8) can read the row’s
confidencefrom the cited passage and warn the user when the answer rests on a 0.3-confidence bit of OCR. Docling does not ship that column. For pipelines where this signal is load-bearing, EasyOCR (or running EasyOCR alongside Docling) is the answer. -
Non-Latin scripts at scale. EasyOCR ships pretrained models for 80+ languages including Chinese, Japanese, Korean, Arabic, Hindi, Cyrillic. Docling’s OCR stack is more limited in non-Latin coverage at the time of writing.
-
Operational constraints that block Docling. Corp SSL inspection breaks the HuggingFace model download. Windows blocks symlinks without Developer Mode. The production image has a strict dependency budget. Air-gapped deployment with no way to ship 3 GB of layout weights. In every one of these, EasyOCR’s 150 MB cached model + CPU inference goes through where Docling does not. You see less of the document, but you see something.
Outside these cases: default to Docling on scans, Azure DI on regulated-cloud-OK shops, vision LLM when the document has handwriting / signatures / a non-textual semantic layer. The adaptive-parsing dispatcher (Article 10) routes automatically.
7. Conclusion
OCR recovers characters. Layout recovers what makes the characters useful, sections, figures, table cells, reading order. The default engine for scans is the one that does both. The full Article-5 family lines up by the same axis:
EasyOCR sits at the OCR floor (line_df only); every other engine in the family adds a layout step on top – Image by author
EasyOCR is the OCR floor, what you get when you stop at recognising characters and never ask “and where are they on the page?” The question matters. “Where on the page” makes the difference between a list of text boxes and a parsed document. The dispatcher of Article 10 (adaptive parsing) picks the right engine per page; this article exists so the dispatcher knows what it gives up when it picks the cheap one.
8. Sources and further reading
EasyOCR is the most reachable traditional OCR engine in 2026; PaddleOCR (Baidu) and Tesseract (Google, decades-old) sit beside it in the same family. The layout step on top is what separates “OCR” from “document parsing”; Docling (Article 5ter) and Azure DI (Article 5bis) both add it, on local hardware and in the cloud respectively. The right cross-reading is the layout literature (Smock et al. 2022 for table structure, Auer et al. 2024 for the full layout cascade) and the alternative OCR engines for non-Latin scripts.
Same direction as the article:
-
JaidedAI, EasyOCR. The library this article documents, including the 80+ language model packs.
-
PaddleOCR (Baidu). Same-class traditional OCR engine; better Chinese coverage, similar layout blindness.
-
Tesseract OCR. The decades-old reference, still widely deployed; same architectural shape as EasyOCR (detection + recognition, no layout).
Different angle, different context:
-
Auer et al., Docling Technical Report, IBM Research 2024 (arXiv:2408.09869). The layout-aware cascade that turns OCR output into a parsed document; the comparison point in section 5 of this article.
-
Smock, Pesala, Abraham, PubTables-1M / Table Transformer (TATR), CVPR 2022 (arXiv:2110.00061). The research lineage behind cell-level table extraction, the single biggest capability EasyOCR lacks.
Earlier in the series:
-
Document Intelligence: series intro. What the series builds, brick by brick, and in what order.
-
Baseline Enterprise RAG, from PDF to highlighted answer. The four-brick pipeline end to end: PDF in, highlighted answer out.
-
Embeddings Aren’t Magic: The Predictable Failure Modes of RAG Retrieval. Where embedding similarity wins (synonyms, typos, paraphrase), where it predictably breaks (unknown terms, negation, term-vs-answer relevance), and how to use it anyway.
-
Rerankers Aren’t Magic Either: When the Cross-Encoder Layer Is Worth the Cost. What a cross-encoder adds over bi-encoder embeddings, measured, and when it is worth the latency.
-
RAG is not machine learning, and the ML toolkit solves the wrong problem. Why chunk-size sweeps and finetuning optimize the wrong thing; route by question type instead.
-
From regex to vision models: which RAG technique fits which problem. Two axes, document complexity and question control, that pick the technique for each case.
-
10 common RAG mistakes we keep seeing in production. Ten production mistakes, organized brick by brick, with the fix for each.
-
Beyond extract_text: the two layers of a PDF that drive RAG quality. The first half of the parsing brick: the document’s nature, signals, and summary.
-
Stop returning flat text from a PDF: the relational shape RAG needs. The second half of the parsing brick: the relational tables every downstream brick reads.