Automating Image Extraction from DOCX Files with Python

Published: 1 month ago (December 22, 2025 at 09:54 PM EST)

7 min read

Source: Dev.to

Extracting Images from Word Documents with Python

Working with Word documents often involves more than just text. Images are integral to many reports, presentations, and creative works. Manually extracting these embedded images—especially from numerous documents—can quickly become tedious, time‑consuming, and error‑prone. Imagine opening dozens (or even hundreds) of Word files, right‑clicking each image, and saving it individually. This is inefficient and a significant roadblock to effective data processing and content management.

Fortunately, Python offers an elegant solution. By leveraging a specialized library, we can automate the entire image‑extraction workflow, turning a manual chore into a swift, script‑driven operation. This tutorial guides you through building a robust Python script to bulk‑extract images from Word documents, boosting productivity and ensuring accuracy.

Why a Library Is Needed

Word documents (especially the modern .docx format) are not simple text files. They are ZIP archives containing multiple XML files, media files, and other resources that define the document’s structure, content, and styling. Direct text parsing is insufficient for extracting embedded objects like images. A programmatic approach requires a library that can understand and navigate this intricate structure.

Introducing Spire.Doc for Python

We will use Spire.Doc for Python, a library designed for creating, reading, editing, and converting Word documents within Python applications. Its comprehensive API lets developers access and manipulate various document elements—including paragraphs, tables, shapes, and, crucially, embedded images—making image extraction straightforward.

1. Install the Library

pip install spire.doc

2. Set Up the Project Structure

Create two folders in the same directory as your script:

InputDocuments – place the Word files you want to process here.
ExtractedImages – extracted images will be saved to this folder.

3. Import Modules & Define Directories

import os
import queue  # For traversing document elements
from spire.doc import *
from spire.doc.common import *

# Define input and output directories
INPUT_DIR = "InputDocuments"      # Folder containing your Word documents
OUTPUT_DIR = "ExtractedImages"    # Folder to save extracted images

# Create output directory if it doesn't exist
if not os.path.exists(OUTPUT_DIR):
    os.makedirs(OUTPUT_DIR)

print(f"Input directory: {os.path.abspath(INPUT_DIR)}")
print(f"Output directory: {os.path.abspath(OUTPUT_DIR)}")

4. Core Extraction Logic

def extract_images_from_doc(doc_path, output_folder):
    """
    Extracts images from a single Word document and saves them to a specified folder.
    """
    document = Document()
    try:
        document.LoadFromFile(doc_path)
    except Exception as e:
        print(f"Error loading document {doc_path}: {e}")
        return

    extracted_images_count = 0
    nodes = queue.Queue()
    nodes.put(document)

    while not nodes.empty():
        node = nodes.get()
        # Iterate through all child objects of the current node
        for i in range(node.ChildObjects.Count):
            child = node.ChildObjects.get_Item(i)

            # -------------------------------------------------
            # Image extraction
            # -------------------------------------------------
            if child.DocumentObjectType == DocumentObjectType.Picture:
                picture = child if isinstance(child, DocPicture) else None
                if picture is not None:
                    image_bytes = picture.ImageBytes

                    # Construct a unique filename (default to PNG)
                    image_filename = f"image_{extracted_images_count + 1}.png"
                    image_filepath = os.path.join(output_folder, image_filename)

                    try:
                        with open(image_filepath, "wb") as img_file:
                            img_file.write(image_bytes)
                        extracted_images_count += 1
                        print(f"  Extracted: {image_filepath}")
                    except Exception as e:
                        print(f"    Error saving image to {image_filepath}: {e}")

            # -------------------------------------------------
            # Continue traversal for composite objects
            # -------------------------------------------------
            elif isinstance(child, ICompositeObject):
                nodes.put(child)

    document.Close()
    print(f"Finished processing '{os.path.basename(doc_path)}' – "
          f"{extracted_images_count} image(s) extracted.")

5. Batch Processing Multiple Documents

def batch_extract_images(input_dir, output_dir):
    """
    Processes all .docx files in the input directory,
    extracting images from each into the output directory.
    """
    for filename in os.listdir(input_dir):
        if filename.lower().endswith(('.docx', '.doc')):
            doc_path = os.path.join(input_dir, filename)
            # Create a subfolder for each document's images (optional)
            doc_output_folder = os.path.join(output_dir, os.path.splitext(filename)[0])
            if not os.path.exists(doc_output_folder):
                os.makedirs(doc_output_folder)

            print(f"\nProcessing: {doc_path}")
            extract_images_from_doc(doc_path, doc_output_folder)

if __name__ == "__main__":
    batch_extract_images(INPUT_DIR, OUTPUT_DIR)

Running the script will:

Scan InputDocuments for .docx/.doc files.
Create a dedicated subfolder for each document inside ExtractedImages.
Extract every embedded image and save it with a sequential name (image_1.png, image_2.png, …).

6. Summary

Problem: Manual image extraction from many Word files is inefficient.
Solution: Use Spire.Doc for Python to programmatically traverse document structures and save embedded images.
Result: A reusable script that bulk‑extracts images, saving time and reducing errors.

Feel free to adapt the script (e.g., change the naming convention, support additional formats, or integrate with other workflows). Happy coding!

Extracting Images from Word Documents with Spire.Doc for Python

Below is a complete, ready‑to‑run script that:

Scans a folder for all .docx files.
Creates a dedicated sub‑folder for each document inside an ExtractedImages directory.
Extracts every embedded picture (including those inside tables, text boxes, etc.) and saves it to the appropriate folder.

1. Imports & Global Paths

import os
import sys
from spire.doc import Document, DocumentObjectType

# ── USER‑CONFIGURABLE PATHS ────────────────────────────────────────
INPUT_DIR  = "WordDocs"          # Folder that contains the .docx files
OUTPUT_DIR = "ExtractedImages"   # Where extracted images will be saved
# ───────────────────────────────────────────────────────────────────────

2. Helper: Extract Images from a Single Document

def extract_images_from_doc(doc_path: str, output_folder: str) -> int:
    """
    Extracts all images from a single Word document and saves them to
    `output_folder`. Returns the number of images extracted.
    """
    try:
        document = Document()
        document.LoadFromFile(doc_path)
    except Exception as e:
        print(f"[ERROR] Could not load '{doc_path}': {e}")
        return 0

    # Ensure the output folder exists
    os.makedirs(output_folder, exist_ok=True)

    extracted_images_count = 0
    queue = list(document.Sections[0].Body.ChildObjects)   # start traversal

    while queue:
        child = queue.pop(0)

        # Enqueue nested children (e.g., table cells, text boxes)
        if hasattr(child, "ChildObjects") and child.ChildObjects:
            queue.extend(child.ChildObjects)

        # Identify picture objects
        if child.DocumentObjectType == DocumentObjectType.Picture:
            picture = child
            img_bytes = picture.ImageBytes

            # Determine a file extension – default to .png if unknown
            ext = ".png"
            if hasattr(picture, "ImageType"):
                ext = f".{picture.ImageType.name.lower()}"

            img_name = f"image_{extracted_images_count + 1}{ext}"
            img_path = os.path.join(output_folder, img_name)

            try:
                with open(img_path, "wb") as f:
                    f.write(img_bytes)
                extracted_images_count += 1
                print(f"  → Saved: {img_name}")
            except Exception as e:
                print(f"[ERROR] Could not save image '{img_name}': {e}")

    document.Close()
    print(f". Extracted {extracted_images_count} images.")
    return extracted_images_count

How it works

Step	Explanation
Load document	`document.LoadFromFile(doc_path)` reads the Word file.
Queue‑based traversal	Guarantees that nested structures (tables, text boxes, etc.) are visited.
Identify images	`child.DocumentObjectType == DocumentObjectType.Picture` flags picture objects.
Retrieve bytes	`picture.ImageBytes` returns the raw image data.
Save to disk	Bytes are written to a file; default extension is `.png` unless `picture.ImageType` provides a better hint.
Error handling	Try/except blocks catch loading or saving failures and report them.

3. Bulk Processing: All `.docx` Files in a Folder

def bulk_extract_images(input_dir: str, output_dir: str) -> None:
    """
    Walks through `input_dir`, extracts images from every .docx file,
    and stores them in sub‑folders under `output_dir`.
    """
    total_images_extracted = 0

    # Gather all .docx files
    doc_files = [f for f in os.listdir(input_dir) if f.lower().endswith(".docx")]

    if not doc_files:
        print(f"No .docx files found in '{input_dir}'.")
        return

    print(f"\nFound {len(doc_files)} Word document(s) to process.\n")

    for doc_file in doc_files:
        full_doc_path = os.path.join(input_dir, doc_file)

        # Create a folder named after the document (without extension)
        doc_name = os.path.splitext(doc_file)[0]
        doc_output_folder = os.path.join(output_dir, doc_name)
        os.makedirs(doc_output_folder, exist_ok=True)

        print(f"Processing '{doc_file}' …")
        images_count = extract_images_from_doc(full_doc_path, doc_output_folder)
        total_images_extracted += images_count

    print(f"\n✅ Bulk extraction complete – total images extracted: {total_images_extracted}")

4. Main Execution Block

if __name__ == "__main__":
    # -----------------------------------------------------------------
    # Create the input folder if it does not exist (for demo purposes)
    # -----------------------------------------------------------------
    if not os.path.isdir(INPUT_DIR):
        os.makedirs(INPUT_DIR, exist_ok=True)
        print(f"Created '{INPUT_DIR}'. Please add .docx files here before re‑running.")
        sys.exit(0)

    # Run the bulk extractor
    bulk_extract_images(INPUT_DIR, OUTPUT_DIR)

5. Quick Reference: Spire.Doc for Python API Used

Method / Property	Description
`Document()`	Creates a new Word document object.
`document.LoadFromFile(path)`	Loads a Word file from disk.
`document.Sections[0].Body.ChildObjects`	Returns the top‑level child objects (paragraphs, tables, pictures, …).
`DocumentObjectType.Picture`	Enum value that identifies picture objects.
`picture.ImageBytes`	Raw byte array of the embedded image.
`document.Close()`	Releases resources held by the document.

Next Steps & Enhancements

Fine‑grained error handling – Catch specific exceptions (FileNotFoundError, IOError, etc.) and log them to a file.
Custom naming scheme – Include the original document name, a timestamp, or a hash in the image filename for easier traceability.
Format conversion – Use Pillow (pip install pillow) to convert all extracted images to a uniform format (e.g., JPEG).
Parallel processing – For thousands of documents, wrap extract_images_from_doc in a multiprocessing.Pool to speed up execution.

TL;DR

The script above automates the tedious task of pulling every picture out of Word files. By simply dropping .docx files into the WordDocs folder and running the script, you’ll end up with a clean directory tree like:

ExtractedImages/
├─ Report_Q1/
│  ├─ image_1.png
│  └─ image_2.jpg
├─ Invoice_2024/
│  ├─ image_1.png
│  └─ image_2.png
└─ …

Feel free to adapt the code to your workflow, add logging, or extend it with additional Spire.Doc features. Happy automating!

Automating Image Extraction from DOCX Files with Python

Extracting Images from Word Documents with Python

Why a Library Is Needed

Introducing Spire.Doc for Python

1. Install the Library

2. Set Up the Project Structure

3. Import Modules & Define Directories

4. Core Extraction Logic

5. Batch Processing Multiple Documents

6. Summary

Extracting Images from Word Documents with Spire.Doc for Python

1. Imports & Global Paths

2. Helper: Extract Images from a Single Document

3. Bulk Processing: All `.docx` Files in a Folder

4. Main Execution Block

5. Quick Reference: Spire.Doc for Python API Used

Next Steps & Enhancements

TL;DR

Related posts

I’m tired of calling glued-together scripts “workflow automation”

🎉 WhatsApp Message Automation for New Year Greetings with Node.js🎉

How I Built an Automation Tool That Auto-Generates Payroll Data

I Built a Sales Visualizer for a Real Business Problem (Quantium Software Engineering Simulation)

Extracting Images from Word Documents with Python

Why a Library Is Needed

Introducing Spire.Doc for Python

1. Install the Library

2. Set Up the Project Structure

3. Import Modules & Define Directories

4. Core Extraction Logic

5. Batch Processing Multiple Documents

6. Summary

Extracting Images from Word Documents with Spire.Doc for Python

1. Imports & Global Paths

2. Helper: Extract Images from a Single Document

3. Bulk Processing: All .docx Files in a Folder

4. Main Execution Block

5. Quick Reference: Spire.Doc for Python API Used

Next Steps & Enhancements

TL;DR

Related posts

I’m tired of calling glued-together scripts “workflow automation”

🎉 WhatsApp Message Automation for New Year Greetings with Node.js🎉

How I Built an Automation Tool That Auto-Generates Payroll Data

I Built a Sales Visualizer for a Real Business Problem (Quantium Software Engineering Simulation)

3. Bulk Processing: All `.docx` Files in a Folder