How to Extract Text from PDF in Python (2026)

Published: (March 16, 2026 at 05:21 AM EDT)
6 min read
Source: Dev.to

Source: Dev.to

Extracting Text from PDFs

Extracting text from PDFs is still one of the most common tasks in data engineering, AI pipelines, and automation workflows. Whether you’re building a search system, a retrieval‑augmented generation (RAG) pipeline, or simply processing reports, the first step is turning PDFs into clean, usable text.

At first glance this sounds simple, but PDFs were never designed to be machine‑readable in the way modern formats are. A PDF is essentially a set of instructions describing how a page should look, not a structured representation of paragraphs, headings, or tables. That means text may be stored in fragments, positioned arbitrarily, or embedded as images.

Because of this, native extraction often produces broken sentences, incorrect reading order, or missing content. Modern tools try to reconstruct structure rather than just reading raw text streams, which is why the choice of extraction method matters.

How PDF Text Extraction Works

Most PDF extraction pipelines follow the same high‑level process:

  1. Parse the document page by page.
  2. Detect text blocks and assemble them into a readable order.
  3. If the document contains scanned pages, apply OCR.
  4. Normalize the output so it can be indexed, searched, or passed to downstream systems.

Even though this workflow sounds straightforward, each step contains a surprising amount of complexity. Reading‑order detection becomes difficult in multi‑column layouts or technical documents. Tables introduce another layer of difficulty, because the visual structure does not always map cleanly to text.

This is why many teams eventually move beyond simple PDF libraries to more complete document‑processing frameworks.

Extracting Text from a PDF in Python

In Python, the basic workflow for extracting text looks the same regardless of the library being used:

  1. Load the document.
  2. Parse it.
  3. Convert it into text that can be printed, stored, or processed further.

Different libraries expose different APIs, but the general pattern remains consistent. The real differences appear in how well they handle layout, performance, and OCR.

Using Kreuzberg for PDF Extraction

Modern document pipelines often require more than just reading text streams. They need consistent metadata, reliable handling of different formats, and good performance when processing large batches of files.

Kreuzberg is designed for this type of workload. It uses a Rust‑based extraction engine with Python bindings (and supports 11 other programming languages as of March 2026), enabling efficient document processing while integrating smoothly into Python pipelines.

Installation

pip install kreuzberg

Synchronous Extraction

from kreuzberg import extract_file_sync

result = extract_file_sync("document.pdf")
print(result.content)                         # extracted text
print(f"Pages: {result.metadata['page_count']}")  # page count

Asynchronous Extraction

import asyncio
from kreuzberg import extract_file

async def main():
    result = await extract_file("document.pdf")
    print(result.content)                         # extracted text
    print(f"Tables found: {len(result.tables)}")  # number of tables

asyncio.run(main())

Both functions return an ExtractionResult object with:

  • result.content – the extracted text.
  • result.tables – a list of detected tables.
  • result.metadata – document properties (e.g., page count, format).

Batch Extraction (Sync)

from pathlib import Path
from kreuzberg import batch_extract_files_sync

paths = list(Path("documents").glob("*.pdf"))
results = batch_extract_files_sync(paths)

for path, result in zip(paths, results):
    print(f"{path.name}: {len(result.content)} characters")

The batch helpers handle concurrency automatically, making it easy to process many PDFs at once.

OCR for Scanned PDFs

Enable OCR by passing an ExtractionConfig with an OcrConfig.

Tesseract (English)

from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig

config = ExtractionConfig(
    ocr=OcrConfig(backend="tesseract", language="eng")
)

result = extract_file_sync("scanned.pdf", config=config)
print(result.content)

PaddleOCR (Chinese)

from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig

config = ExtractionConfig(
    ocr=OcrConfig(backend="paddleocr", language="zh")
)

result = extract_file_sync("scanned.pdf", config=config)
print(f"Extracted content (preview): {result.content[:100]}")
print(f"Total characters: {len(result.content)}")

Common Runtime Issue

If you encounter a libonnxruntime.so loading error, first upgrade/install onnxruntime:

python -m pip install --upgrade onnxruntime

If the error persists on Linux, add the onnxruntime/capi directory to LD_LIBRARY_PATH (replace <venv_path> with your actual virtual‑environment path):

export LD_LIBRARY_PATH="/<venv_path>/lib/pythonX.Y/site-packages/onnxruntime/capi:$LD_LIBRARY_PATH"

Kreuzberg supports Tesseract, EasyOCR, and PaddleOCR as back‑ends, which is useful for multilingual documents where backend quality varies by language.

Extracting Tables and Structured Content

Tables are another area where simple extraction approaches struggle. Even when the text is captured correctly, the relationships between rows and columns may be lost.

More advanced extraction pipelines attempt to detect table regions and preserve structure so that data remains usable. This is particularly important in financial reports, research papers, and operational documents where tables often contain the most important information.

Performance and Scaling Considerations

Performance becomes increasingly important as soon as you begin processing more than a handful of files. Batch ingestion, RAG pipelines, and search‑indexing workflows may involve thousands or millions of documents, and inefficiencies at the parsing stage quickly become expensive.

Several factors influence performance, including:

  • Implementation – how the parsing engine is written (compiled vs. interpreted).
  • Memory management – efficient use of RAM and streaming of large files.
  • Concurrency support – ability to run multiple parses in parallel.

Tools that rely heavily on interpreted execution or external subprocesses often encounter bottlenecks at scale, while native parsing engines tend to perform better under sustained workloads. This is one reason many modern document‑processing tools use compiled cores with language bindings on top.

Where PDF Extraction Fits in a Modern Pipeline

In most real systems, text extraction is only the first step. Once text is available, it is typically:

  1. Split into chunks.
  2. Converted into embeddings.
  3. Stored in a vector database for retrieval.

This architecture has become standard for document search and Retrieval‑Augmented Generation (RAG) systems because it allows large collections of documents to be queried efficiently. Reliable extraction is the foundation that makes everything else possible.

Common Pitfalls

Developers new to PDF extraction often assume that all PDFs behave the same way. In reality, documents vary widely in structure and quality, and a pipeline that works well for one dataset may fail on another.

Tips to avoid pitfalls

  • Test with diverse documents – include scanned files, multi‑column layouts, and large reports. Problems usually appear quickly under realistic conditions.
  • Don’t ignore metadata – page numbers, titles, and document hierarchy often become critical later, especially when building retrieval systems that need to cite sources.

Final Thoughts

Extracting text from PDFs in Python is easier than it was a few years ago, but the fundamental challenges of document structure and layout remain. Choosing tools that handle these complexities well can significantly improve the quality of downstream systems—from search to RAG to analytics. Once the ingestion layer is reliable, the rest of the pipeline becomes far easier to design and maintain.

0 views
Back to Blog

Related posts

Read more »