제어된 PDF 텍스트 추출을 위한 간단한 Python 도구 (PyPDF)

발행: 2주 전 (2026년 1월 19일 오후 05:48 GMT+9)

4 분 소요

Source: Dev.to

개요

이 작고 명령줄 기반 Python 프로그램은 PDF 파일에서 텍스트를 제어 가능하고 예측 가능한 방식으로 추출합니다. pypdf 라이브러리를 기반으로 하며, 시각적 레이아웃보다 신뢰성에 중점을 두어 분석이나 변환 전에 문서를 전처리하는 데 적합합니다.

프로그램은 PDF를 페이지 단위로 읽어 들이고, 콘텐츠 스트림에서 직접 텍스트 조각을 수집합니다. 특정 폰트 이름과 크기로 렌더링된 텍스트만 추출하도록 폰트 기반 필터링을 활성화할 수 있으며, 기본 설정에서는 필터가 비활성화되어 모든 텍스트가 캡처됩니다.

기능

정확한 글꼴 이름과 허용 오차가 있는 글꼴 크기에 대한 선택적 필터링
마침표 뒤에 자동으로 줄 바꿈 삽입
하이픈으로 연결된 줄 끝을 지능적으로 병합
파이프라인 사용이 용이하도록 표준 출력으로 스트리밍 출력
스크립트 상단에 중앙 집중된 최소 구성

사용법

python extract_pdf_text.py path/to/document.pdf

스크립트는 추출된 텍스트를 stdout에 기록하므로, 출력 결과를 다른 도구에 파이프하거나 파일로 리다이렉트할 수 있습니다.

코드

#!/usr/bin/env python3
from __future__ import annotations

import math
import sys
from typing import Iterator, Optional, Tuple

from pypdf import PdfReader

# =========================
# Extraction conditions (adjust only here if needed)
# =========================
TARGET_FONTS = {
    ("Hoge", 12.555059999999997),
    ("Fuga", 12.945840000000032),
}
SIZE_TOL = 1e-6  # Tolerance for math.isclose

# As in the original code, extraction of all text (font filter disabled) is the default
ENABLE_FONT_FILTER = False


def _normalize_font_name(raw) -> Optional[str]:
    """
    Convert and normalize font information passed from pypdf into a string.
    Example: NameObject('/Hoge') -> 'Hoge'
    """
    if raw is None:
        return None
    s = str(raw)
    if s.startswith("/"):
        s = s[1:]
    return s or None


def is_target_text(font_name: Optional[str], font_size: Optional[float]) -> bool:
    """Determine whether a text fragment is a target for extraction (by font name and size)."""
    if not ENABLE_FONT_FILTER:
        return True

    if font_name is None or font_size is None:
        return False

    for f, sz in TARGET_FONTS:
        if font_name == f and math.isclose(font_size, sz, rel_tol=0.0, abs_tol=SIZE_TOL):
            return True
    return False


def extract_text_stream(fp) -> Iterator[str]:
    """
    - Extract only target text (optionally filtered by font name and size)
    - Replace '.' with '.\\n'
    - If a line ends with '-', merge it with the next line (remove the trailing '-')
    """
    reader = PdfReader(fp)

    carry = ""  # Buffer for joining lines when a line ends with a hyphen

    for page in reader.pages:
        chunks: list[str] = []

        def visitor_text(
            text: str,
            cm,  # current transformation matrix
            tm,  # text matrix
            font_dict,
            font_size: float,
        ):
            # Guard because text may be empty
            if not text:
                return

            # font_dict is often a dict-like object (some PDFs may not provide it)
            base_font = None
            try:
                if font_dict:
                    base_font = font_dict.get("/BaseFont")
            except Exception:
                base_font = None

            font_name = _normalize_font_name(base_font)
            size = float(font_size) if font_size is not None else None

            if is_target_text(font_name, size):
                chunks.append(text)

        # Using visitor_text allows collecting text fragments
        # in the order of the content stream
        page.extract_text(visitor_text=visitor_text)

        s = "".join(chunks)
        if not s:
            continue

        s = s.replace(".", ".\n")

        for line in s.splitlines(keepends=False):
            if carry:
                line = carry + line
                carry = ""

            if line.endswith("-"):
                carry = line[:-1]
                continue

            yield line

    if carry:
        yield carry


def main(pdf_path: str) -> None:
    with open(pdf_path, "rb") as f:
        for chunk in extract_text_stream(f):
            sys.stdout.buffer.write(chunk.encode() + b"\n")


if __name__ == "__main__":
    path = sys.argv[1]
    main(path)

제어된 PDF 텍스트 추출을 위한 간단한 Python 도구 (PyPDF)

개요

기능

사용법

코드

관련 글

터미널을 깨끗하게 유지하는 작은 pip 플래그

🎨 Python으로 Background Generator Tool 만들기 (단계별)

Show HN: Pdfwithlove – PDF 도구가 100% 로컬에서 실행됩니다 (업로드 없음, 백엔드 없음)

🔲 초보자용 가이드 ‘Maximum Side Length of a Square’ – LeetCode 1292 (C++, Python, JavaScript)