一个用于受控 PDF 文本提取的简易 Python 工具 (PyPDF)

发布: (2026年1月19日 GMT+8 16:48)
4 min read
原文: Dev.to

I’m ready to translate the article, but I only see the source link you provided. Could you please paste the full text (or the portion you’d like translated) here? I’ll keep the source line unchanged and translate the rest into Simplified Chinese while preserving all formatting, markdown, and code blocks.

概述

这个紧凑的命令行 Python 程序以受控且可预测的方式从 PDF 文件中提取文本。它基于 pypdf 库构建,侧重于可靠性而非视觉布局,使其适合在分析或转换之前对文档进行预处理。

程序逐页读取 PDF,并直接从内容流中收集文本片段。可以启用基于字体的过滤,只提取使用特定字体名称和大小渲染的文本;默认情况下过滤器是关闭的,所有文本都会被捕获。

功能

  • 可选的按精确字体名称和带容差的字体大小过滤
  • 在句号后自动插入换行
  • 智能合并连字符结尾的行
  • 将输出流式传输到标准输出,便于管道使用
  • 脚本顶部集中最少的配置

用法

python extract_pdf_text.py path/to/document.pdf

脚本将提取的文本写入 stdout,允许您将输出通过管道传递给其他工具或重定向到文件。

代码

#!/usr/bin/env python3
from __future__ import annotations

import math
import sys
from typing import Iterator, Optional, Tuple

from pypdf import PdfReader

# =========================
# Extraction conditions (adjust only here if needed)
# =========================
TARGET_FONTS = {
    ("Hoge", 12.555059999999997),
    ("Fuga", 12.945840000000032),
}
SIZE_TOL = 1e-6  # Tolerance for math.isclose

# As in the original code, extraction of all text (font filter disabled) is the default
ENABLE_FONT_FILTER = False


def _normalize_font_name(raw) -> Optional[str]:
    """
    Convert and normalize font information passed from pypdf into a string.
    Example: NameObject('/Hoge') -> 'Hoge'
    """
    if raw is None:
        return None
    s = str(raw)
    if s.startswith("/"):
        s = s[1:]
    return s or None


def is_target_text(font_name: Optional[str], font_size: Optional[float]) -> bool:
    """Determine whether a text fragment is a target for extraction (by font name and size)."""
    if not ENABLE_FONT_FILTER:
        return True

    if font_name is None or font_size is None:
        return False

    for f, sz in TARGET_FONTS:
        if font_name == f and math.isclose(font_size, sz, rel_tol=0.0, abs_tol=SIZE_TOL):
            return True
    return False


def extract_text_stream(fp) -> Iterator[str]:
    """
    - Extract only target text (optionally filtered by font name and size)
    - Replace '.' with '.\\n'
    - If a line ends with '-', merge it with the next line (remove the trailing '-')
    """
    reader = PdfReader(fp)

    carry = ""  # Buffer for joining lines when a line ends with a hyphen

    for page in reader.pages:
        chunks: list[str] = []

        def visitor_text(
            text: str,
            cm,  # current transformation matrix
            tm,  # text matrix
            font_dict,
            font_size: float,
        ):
            # Guard because text may be empty
            if not text:
                return

            # font_dict is often a dict-like object (some PDFs may not provide it)
            base_font = None
            try:
                if font_dict:
                    base_font = font_dict.get("/BaseFont")
            except Exception:
                base_font = None

            font_name = _normalize_font_name(base_font)
            size = float(font_size) if font_size is not None else None

            if is_target_text(font_name, size):
                chunks.append(text)

        # Using visitor_text allows collecting text fragments
        # in the order of the content stream
        page.extract_text(visitor_text=visitor_text)

        s = "".join(chunks)
        if not s:
            continue

        s = s.replace(".", ".\n")

        for line in s.splitlines(keepends=False):
            if carry:
                line = carry + line
                carry = ""

            if line.endswith("-"):
                carry = line[:-1]
                continue

            yield line

    if carry:
        yield carry


def main(pdf_path: str) -> None:
    with open(pdf_path, "rb") as f:
        for chunk in extract_text_stream(f):
            sys.stdout.buffer.write(chunk.encode() + b"\n")


if __name__ == "__main__":
    path = sys.argv[1]
    main(path)
Back to Blog

相关文章

阅读更多 »

Lyra:命令行助手

我为助手编写了框架和主循环。之所以选择 CLI 助手而不是语音或 AI 助手,是因为我的硬件限制。我……