一个用于受控 PDF 文本提取的简易 Python 工具 (PyPDF)
I’m ready to translate the article, but I only see the source link you provided. Could you please paste the full text (or the portion you’d like translated) here? I’ll keep the source line unchanged and translate the rest into Simplified Chinese while preserving all formatting, markdown, and code blocks.
概述
这个紧凑的命令行 Python 程序以受控且可预测的方式从 PDF 文件中提取文本。它基于 pypdf 库构建,侧重于可靠性而非视觉布局,使其适合在分析或转换之前对文档进行预处理。
程序逐页读取 PDF,并直接从内容流中收集文本片段。可以启用基于字体的过滤,只提取使用特定字体名称和大小渲染的文本;默认情况下过滤器是关闭的,所有文本都会被捕获。
功能
- 可选的按精确字体名称和带容差的字体大小过滤
- 在句号后自动插入换行
- 智能合并连字符结尾的行
- 将输出流式传输到标准输出,便于管道使用
- 脚本顶部集中最少的配置
用法
python extract_pdf_text.py path/to/document.pdf
脚本将提取的文本写入 stdout,允许您将输出通过管道传递给其他工具或重定向到文件。
代码
#!/usr/bin/env python3
from __future__ import annotations
import math
import sys
from typing import Iterator, Optional, Tuple
from pypdf import PdfReader
# =========================
# Extraction conditions (adjust only here if needed)
# =========================
TARGET_FONTS = {
("Hoge", 12.555059999999997),
("Fuga", 12.945840000000032),
}
SIZE_TOL = 1e-6 # Tolerance for math.isclose
# As in the original code, extraction of all text (font filter disabled) is the default
ENABLE_FONT_FILTER = False
def _normalize_font_name(raw) -> Optional[str]:
"""
Convert and normalize font information passed from pypdf into a string.
Example: NameObject('/Hoge') -> 'Hoge'
"""
if raw is None:
return None
s = str(raw)
if s.startswith("/"):
s = s[1:]
return s or None
def is_target_text(font_name: Optional[str], font_size: Optional[float]) -> bool:
"""Determine whether a text fragment is a target for extraction (by font name and size)."""
if not ENABLE_FONT_FILTER:
return True
if font_name is None or font_size is None:
return False
for f, sz in TARGET_FONTS:
if font_name == f and math.isclose(font_size, sz, rel_tol=0.0, abs_tol=SIZE_TOL):
return True
return False
def extract_text_stream(fp) -> Iterator[str]:
"""
- Extract only target text (optionally filtered by font name and size)
- Replace '.' with '.\\n'
- If a line ends with '-', merge it with the next line (remove the trailing '-')
"""
reader = PdfReader(fp)
carry = "" # Buffer for joining lines when a line ends with a hyphen
for page in reader.pages:
chunks: list[str] = []
def visitor_text(
text: str,
cm, # current transformation matrix
tm, # text matrix
font_dict,
font_size: float,
):
# Guard because text may be empty
if not text:
return
# font_dict is often a dict-like object (some PDFs may not provide it)
base_font = None
try:
if font_dict:
base_font = font_dict.get("/BaseFont")
except Exception:
base_font = None
font_name = _normalize_font_name(base_font)
size = float(font_size) if font_size is not None else None
if is_target_text(font_name, size):
chunks.append(text)
# Using visitor_text allows collecting text fragments
# in the order of the content stream
page.extract_text(visitor_text=visitor_text)
s = "".join(chunks)
if not s:
continue
s = s.replace(".", ".\n")
for line in s.splitlines(keepends=False):
if carry:
line = carry + line
carry = ""
if line.endswith("-"):
carry = line[:-1]
continue
yield line
if carry:
yield carry
def main(pdf_path: str) -> None:
with open(pdf_path, "rb") as f:
for chunk in extract_text_stream(f):
sys.stdout.buffer.write(chunk.encode() + b"\n")
if __name__ == "__main__":
path = sys.argv[1]
main(path)