How to Extract Text from PDF in Python (2026)
Source: Dev.to
Extracting Text from PDFs
Extracting text from PDFs is still one of the most common tasks in data engineering, AI pipelines, and automation workflows. Whether you’re building a search system, a retrieval‑augmented generation (RAG) pipeline, or simply processing reports, the first step is turning PDFs into clean, usable text.
At first glance this sounds simple, but PDFs were never designed to be machine‑readable in the way modern formats are. A PDF is essentially a set of instructions describing how a page should look, not a structured representation of paragraphs, headings, or tables. That means text may be stored in fragments, positioned arbitrarily, or embedded as images.
Because of this, native extraction often produces broken sentences, incorrect reading order, or missing content. Modern tools try to reconstruct structure rather than just reading raw text streams, which is why the choice of extraction method matters.
How PDF Text Extraction Works
Most PDF extraction pipelines follow the same high‑level process:
- Parse the document page by page.
- Detect text blocks and assemble them into a readable order.
- If the document contains scanned pages, apply OCR.
- Normalize the output so it can be indexed, searched, or passed to downstream systems.
Even though this workflow sounds straightforward, each step contains a surprising amount of complexity. Reading‑order detection becomes difficult in multi‑column layouts or technical documents. Tables introduce another layer of difficulty, because the visual structure does not always map cleanly to text.
This is why many teams eventually move beyond simple PDF libraries to more complete document‑processing frameworks.
Extracting Text from a PDF in Python
In Python, the basic workflow for extracting text looks the same regardless of the library being used:
- Load the document.
- Parse it.
- Convert it into text that can be printed, stored, or processed further.
Different libraries expose different APIs, but the general pattern remains consistent. The real differences appear in how well they handle layout, performance, and OCR.
Using Kreuzberg for PDF Extraction
Modern document pipelines often require more than just reading text streams. They need consistent metadata, reliable handling of different formats, and good performance when processing large batches of files.
Kreuzberg is designed for this type of workload. It uses a Rust‑based extraction engine with Python bindings (and supports 11 other programming languages as of March 2026), enabling efficient document processing while integrating smoothly into Python pipelines.
Installation
pip install kreuzbergSynchronous Extraction
from kreuzberg import extract_file_sync
result = extract_file_sync("document.pdf")
print(result.content) # extracted text
print(f"Pages: {result.metadata['page_count']}") # page countAsynchronous Extraction
import asyncio
from kreuzberg import extract_file
async def main():
result = await extract_file("document.pdf")
print(result.content) # extracted text
print(f"Tables found: {len(result.tables)}") # number of tables
asyncio.run(main())Both functions return an ExtractionResult object with:
result.content– the extracted text.result.tables– a list of detected tables.result.metadata– document properties (e.g., page count, format).
Batch Extraction (Sync)
from pathlib import Path
from kreuzberg import batch_extract_files_sync
paths = list(Path("documents").glob("*.pdf"))
results = batch_extract_files_sync(paths)
for path, result in zip(paths, results):
print(f"{path.name}: {len(result.content)} characters")The batch helpers handle concurrency automatically, making it easy to process many PDFs at once.
OCR for Scanned PDFs
Enable OCR by passing an ExtractionConfig with an OcrConfig.
Tesseract (English)
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig
config = ExtractionConfig(
ocr=OcrConfig(backend="tesseract", language="eng")
)
result = extract_file_sync("scanned.pdf", config=config)
print(result.content)PaddleOCR (Chinese)
from kreuzberg import extract_file_sync, ExtractionConfig, OcrConfig
config = ExtractionConfig(
ocr=OcrConfig(backend="paddleocr", language="zh")
)
result = extract_file_sync("scanned.pdf", config=config)
print(f"Extracted content (preview): {result.content[:100]}")
print(f"Total characters: {len(result.content)}")Common Runtime Issue
If you encounter a libonnxruntime.so loading error, first upgrade/install onnxruntime:
python -m pip install --upgrade onnxruntimeIf the error persists on Linux, add the onnxruntime/capi directory to LD_LIBRARY_PATH (replace <venv_path> with your actual virtual‑environment path):
export LD_LIBRARY_PATH="/<venv_path>/lib/pythonX.Y/site-packages/onnxruntime/capi:$LD_LIBRARY_PATH"Kreuzberg supports Tesseract, EasyOCR, and PaddleOCR as back‑ends, which is useful for multilingual documents where backend quality varies by language.
Extracting Tables and Structured Content
Tables are another area where simple extraction approaches struggle. Even when the text is captured correctly, the relationships between rows and columns may be lost.
More advanced extraction pipelines attempt to detect table regions and preserve structure so that data remains usable. This is particularly important in financial reports, research papers, and operational documents where tables often contain the most important information.
Performance and Scaling Considerations
Performance becomes increasingly important as soon as you begin processing more than a handful of files. Batch ingestion, RAG pipelines, and search‑indexing workflows may involve thousands or millions of documents, and inefficiencies at the parsing stage quickly become expensive.
Several factors influence performance, including:
- Implementation – how the parsing engine is written (compiled vs. interpreted).
- Memory management – efficient use of RAM and streaming of large files.
- Concurrency support – ability to run multiple parses in parallel.
Tools that rely heavily on interpreted execution or external subprocesses often encounter bottlenecks at scale, while native parsing engines tend to perform better under sustained workloads. This is one reason many modern document‑processing tools use compiled cores with language bindings on top.
Where PDF Extraction Fits in a Modern Pipeline
In most real systems, text extraction is only the first step. Once text is available, it is typically:
- Split into chunks.
- Converted into embeddings.
- Stored in a vector database for retrieval.
This architecture has become standard for document search and Retrieval‑Augmented Generation (RAG) systems because it allows large collections of documents to be queried efficiently. Reliable extraction is the foundation that makes everything else possible.
Common Pitfalls
Developers new to PDF extraction often assume that all PDFs behave the same way. In reality, documents vary widely in structure and quality, and a pipeline that works well for one dataset may fail on another.
Tips to avoid pitfalls
- Test with diverse documents – include scanned files, multi‑column layouts, and large reports. Problems usually appear quickly under realistic conditions.
- Don’t ignore metadata – page numbers, titles, and document hierarchy often become critical later, especially when building retrieval systems that need to cite sources.
Final Thoughts
Extracting text from PDFs in Python is easier than it was a few years ago, but the fundamental challenges of document structure and layout remain. Choosing tools that handle these complexities well can significantly improve the quality of downstream systems—from search to RAG to analytics. Once the ingestion layer is reliable, the rest of the pipeline becomes far easier to design and maintain.