使用 Python 自动提取 DOCX 文件中的图像

发布: 1个月前 (2025年12月23日 GMT+8 10:54)

10 分钟阅读

Source: Dev.to

使用 Python 从 Word 文档中提取图片

处理 Word 文档时，往往不仅仅是文本。图片是许多报告、演示文稿和创意作品的重要组成部分。手动提取这些嵌入的图片——尤其是面对大量文档时——很快会变得繁琐、耗时且容易出错。想象一下，需要打开数十（甚至数百）个 Word 文件，逐个右键点击图片并单独保存。这种做法效率低下，严重阻碍了有效的数据处理和内容管理。

幸运的是，Python 提供了优雅的解决方案。通过利用专门的库，我们可以自动化整个图片提取工作流，将手动的繁琐任务转变为快速、脚本驱动的操作。本教程将指导你构建一个强大的 Python 脚本，实现对 Word 文档的批量图片提取，从而提升生产力并确保准确性。

为什么需要库

Word 文档（尤其是现代的 .docx 格式）并不是简单的文本文件。它们是包含多个 XML 文件、媒体文件以及其他资源的 ZIP 压缩包，这些资源定义了文档的结构、内容和样式。仅靠文本解析无法提取嵌入的对象（如图片）。要以编程方式处理，需要一个能够理解并遍历这种复杂结构的库。

介绍 Spire.Doc for Python

我们将使用 Spire.Doc for Python，这是一款用于在 Python 应用中创建、读取、编辑和转换 Word 文档的库。其完整的 API 让开发者能够访问和操作各种文档元素——包括段落、表格、形状，以及关键的嵌入图片——从而使图片提取变得简单直接。

1. 安装库

pip install spire.doc

2. 设置项目结构

在与你的脚本相同的目录下创建两个文件夹：

InputDocuments – 将你想要处理的 Word 文件放在此处。
ExtractedImages – 提取的图像将保存到此文件夹。

3. 导入模块并定义目录

import os
import queue  # For traversing document elements
from spire.doc import *
from spire.doc.common import *

# Define input and output directories
INPUT_DIR = "InputDocuments"      # Folder containing your Word documents
OUTPUT_DIR = "ExtractedImages"    # Folder to save extracted images

# Create output directory if it doesn't exist
if not os.path.exists(OUTPUT_DIR):
    os.makedirs(OUTPUT_DIR)

print(f"Input directory: {os.path.abspath(INPUT_DIR)}")
print(f"Output directory: {os.path.abspath(OUTPUT_DIR)}")

4. 核心提取逻辑

def extract_images_from_doc(doc_path, output_folder):
    """
    从单个 Word 文档中提取图像并保存到指定文件夹。
    """
    document = Document()
    try:
        document.LoadFromFile(doc_path)
    except Exception as e:
        print(f"加载文档 {doc_path} 时出错: {e}")
        return

    extracted_images_count = 0
    nodes = queue.Queue()
    nodes.put(document)

    while not nodes.empty():
        node = nodes.get()
        # Iterate through all child objects of the current node
        for i in range(node.ChildObjects.Count):
            child = node.ChildObjects.get_Item(i)

            # -------------------------------------------------
            # 图像提取
            # -------------------------------------------------
            if child.DocumentObjectType == DocumentObjectType.Picture:
                picture = child if isinstance(child, DocPicture) else None
                if picture is not None:
                    image_bytes = picture.ImageBytes

                    # Construct a unique filename (default to PNG)
                    image_filename = f"image_{extracted_images_count + 1}.png"
                    image_filepath = os.path.join(output_folder, image_filename)

                    try:
                        with open(image_filepath, "wb") as img_file:
                            img_file.write(image_bytes)
                        extracted_images_count += 1
                        print(f"  已提取: {image_filepath}")
                    except Exception as e:
                        print(f"    保存图像到 {image_filepath} 时出错: {e}")

            # -------------------------------------------------
            # 继续遍历复合对象
            # -------------------------------------------------
            elif isinstance(child, ICompositeObject):
                nodes.put(child)

    document.Close()
    print(f"完成处理 '{os.path.basename(doc_path)}' – "
          f"{extracted_images_count} 张图像已提取。")

5. 批量处理多个文档

def batch_extract_images(input_dir, output_dir):
    """
    Processes all .docx files in the input directory,
    extracting images from each into the output directory.
    """
    for filename in os.listdir(input_dir):
        if filename.lower().endswith(('.docx', '.doc')):
            doc_path = os.path.join(input_dir, filename)
            # Create a subfolder for each document's images (optional)
            doc_output_folder = os.path.join(output_dir, os.path.splitext(filename)[0])
            if not os.path.exists(doc_output_folder):
                os.makedirs(doc_output_folder)

            print(f"\nProcessing: {doc_path}")
            extract_images_from_doc(doc_path, doc_output_folder)

if __name__ == "__main__":
    batch_extract_images(INPUT_DIR, OUTPUT_DIR)

运行脚本将会：

扫描 InputDocuments 目录中的 .docx/.doc 文件。
在 ExtractedImages 中为每个文档创建专用的子文件夹。
提取所有嵌入的图片，并使用顺序名称（image_1.png、image_2.png，……）保存。

6. 摘要

问题：手动从大量 Word 文件中提取图片效率低下。
解决方案：使用 Spire.Doc for Python 以编程方式遍历文档结构并保存嵌入的图片。
结果：一个可复用的脚本，可批量提取图片，节省时间并降低错误。

欢迎根据需要调整脚本（例如，更改命名规则、支持其他格式或与其他工作流集成）。祝编码愉快！

使用 Spire.Doc for Python 从 Word 文档中提取图片

下面是一段完整的、可直接运行的脚本，它能够：

扫描文件夹，查找所有 .docx 文件。
为每个文档在 ExtractedImages 目录下创建专属子文件夹。
提取所有嵌入的图片（包括表格、文本框等内部的图片），并保存到相应的文件夹中。

1. 导入模块与全局路径

import os
import sys
from spire.doc import Document, DocumentObjectType

# ── USER‑CONFIGURABLE PATHS ────────────────────────────────────────
INPUT_DIR  = "WordDocs"          # Folder that contains the .docx files
OUTPUT_DIR = "ExtractedImages"   # Where extracted images will be saved
# ───────────────────────────────────────────────────────────────────────

2. 辅助函数：从单个文档中提取图片

def extract_images_from_doc(doc_path: str, output_folder: str) -> int:
    """
    Extracts all images from a single Word document and saves them to
    `output_folder`. Returns the number of images extracted.
    """
    try:
        document = Document()
        document.LoadFromFile(doc_path)
    except Exception as e:
        print(f"[ERROR] Could not load '{doc_path}': {e}")
        return 0

    # Ensure the output folder exists
    os.makedirs(output_folder, exist_ok=True)

    extracted_images_count = 0
    queue = list(document.Sections[0].Body.ChildObjects)   # start traversal

    while queue:
        child = queue.pop(0)

        # Enqueue nested children (e.g., table cells, text boxes)
        if hasattr(child, "ChildObjects") and child.ChildObjects:
            queue.extend(child.ChildObjects)

        # Identify picture objects
        if child.DocumentObjectType == DocumentObjectType.Picture:
            picture = child
            img_bytes = picture.ImageBytes

            # Determine a file extension – default to .png if unknown
            ext = ".png"
            if hasattr(picture, "ImageType"):
                ext = f".{picture.ImageType.name.lower()}"

            img_name = f"image_{extracted_images_count + 1}{ext}"
            img_path = os.path.join(output_folder, img_name)

            try:
                with open(img_path, "wb") as f:
                    f.write(img_bytes)
                extracted_images_count += 1
                print(f"  → Saved: {img_name}")
            except Exception as e:
                print(f"[ERROR] Could not save image '{img_name}': {e}")

    document.Close()
    print(f". Extracted {extracted_images_count} images.")
    return extracted_images_count

工作原理

步骤	说明
加载文档	`document.LoadFromFile(doc_path)` 读取 Word 文件。
基于队列的遍历	确保访问到嵌套结构（表格、文本框等）。
识别图片	`child.DocumentObjectType == DocumentObjectType.Picture` 标记图片对象。
获取字节数据	`picture.ImageBytes` 返回原始图片数据。
保存到磁盘	将字节写入文件；默认扩展名为 `.png`，除非 `picture.ImageType` 提供更合适的提示。
错误处理	try/except 块捕获加载或保存失败并报告。

3. 批量处理：文件夹中的所有 `.docx` 文件

def bulk_extract_images(input_dir: str, output_dir: str) -> None:
    """
    Walks through `input_dir`, extracts images from every .docx file,
    and stores them in sub‑folders under `output_dir`.
    """
    total_images_extracted = 0

    # Gather all .docx files
    doc_files = [f for f in os.listdir(input_dir) if f.lower().endswith(".docx")]

    if not doc_files:
        print(f"No .docx files found in '{input_dir}'.")
        return

    print(f"\nFound {len(doc_files)} Word document(s) to process.\n")

    for doc_file in doc_files:
        full_doc_path = os.path.join(input_dir, doc_file)

        # Create a folder named after th

> **Source:** ...

```python
# e 文档（不含扩展名）
doc_name = os.path.splitext(doc_file)[0]
doc_output_folder = os.path.join(output_dir, doc_name)
os.makedirs(doc_output_folder, exist_ok=True)

print(f"Processing '{doc_file}' …")
images_count = extract_images_from_doc(full_doc_path, doc_output_folder)
total_images_extracted += images_count

print(f"\n✅ Bulk extraction complete – total images extracted: {total_images_extracted}")

4. 主执行块

if __name__ == "__main__":
    # -----------------------------------------------------------------
    # Create the input folder if it does not exist (for demo purposes)
    # -----------------------------------------------------------------
    if not os.path.isdir(INPUT_DIR):
        os.makedirs(INPUT_DIR, exist_ok=True)
        print(f"Created '{INPUT_DIR}'. Please add .docx files here before re‑running.")
        sys.exit(0)

    # Run the bulk extractor
    bulk_extract_images(INPUT_DIR, OUTPUT_DIR)

5. 快速参考：使用的 Spire.Doc Python API

方法 / 属性	描述
`Document()`	创建一个新的 Word 文档对象。
`document.LoadFromFile(path)`	从磁盘加载 Word 文件。
`document.Sections[0].Body.ChildObjects`	返回顶层子对象（段落、表格、图片，…）。
`DocumentObjectType.Picture`	标识图片对象的枚举值。
`picture.ImageBytes`	嵌入图像的原始字节数组。
`document.Close()`	释放文档占用的资源。

下一步与增强

细粒度错误处理 – 捕获特定异常（FileNotFoundError、IOError 等），并将其记录到文件中。
自定义命名方案 – 在图像文件名中加入原始文档名称、时间戳或哈希，以便更容易追溯。
格式转换 – 使用 Pillow（pip install pillow）将所有提取的图像转换为统一格式（例如 JPEG）。
并行处理 – 对于成千上万的文档，可将 extract_images_from_doc 包装在 multiprocessing.Pool 中，以加快执行速度。

TL;DR

上面的脚本实现了自动化从 Word 文件中提取所有图片的繁琐任务。只需将 .docx 文件放入 WordDocs 文件夹并运行脚本，即可得到如下整洁的目录结构：

ExtractedImages/
├─ Report_Q1/
│  ├─ image_1.png
│  └─ image_2.jpg
├─ Invoice_2024/
│  ├─ image_1.png
│  └─ image_2.png
└─ …

欢迎根据你的工作流对代码进行改造，添加日志，或结合更多 Spire.Doc 功能进行扩展。祝自动化愉快！

使用 Python 自动提取 DOCX 文件中的图像

使用 Python 从 Word 文档中提取图片

为什么需要库

介绍 Spire.Doc for Python

1. 安装库

2. 设置项目结构

3. 导入模块并定义目录

4. 核心提取逻辑

5. 批量处理多个文档

6. 摘要

使用 Spire.Doc for Python 从 Word 文档中提取图片

1. 导入模块与全局路径

2. 辅助函数：从单个文档中提取图片

3. 批量处理：文件夹中的所有 `.docx` 文件

4. 主执行块

5. 快速参考：使用的 Spire.Doc Python API

下一步与增强

TL;DR

相关文章

我厌倦了把拼凑的脚本称作“workflow automation”。

🎉 使用 Node.js 实现 WhatsApp 消息自动化发送新年祝福 🎉

我如何构建了一个自动生成工资数据的自动化工具

我为真实商业问题构建了 Sales Visualizer（Quantium 软件工程模拟）

使用 Python 从 Word 文档中提取图片

为什么需要库

介绍 Spire.Doc for Python

1. 安装库

2. 设置项目结构

3. 导入模块并定义目录

4. 核心提取逻辑

5. 批量处理多个文档

6. 摘要

使用 Spire.Doc for Python 从 Word 文档中提取图片

1. 导入模块与全局路径

2. 辅助函数：从单个文档中提取图片

3. 批量处理：文件夹中的所有 .docx 文件

4. 主执行块

5. 快速参考：使用的 Spire.Doc Python API

下一步与增强

TL;DR

相关文章

我厌倦了把拼凑的脚本称作“workflow automation”。

🎉 使用 Node.js 实现 WhatsApp 消息自动化发送新年祝福 🎉

我如何构建了一个自动生成工资数据的自动化工具

我为真实商业问题构建了 Sales Visualizer（Quantium 软件工程模拟）

3. 批量处理：文件夹中的所有 `.docx` 文件