Automating Image Extraction from DOCX Files with Python
Source: Dev.to
Extracting Images from Word Documents with Python
Working with Word documents often involves more than just text. Images are integral to many reports, presentations, and creative works. Manually extracting these embedded images—especially from numerous documents—can quickly become tedious, time‑consuming, and error‑prone. Imagine opening dozens (or even hundreds) of Word files, right‑clicking each image, and saving it individually. This is inefficient and a significant roadblock to effective data processing and content management.
Fortunately, Python offers an elegant solution. By leveraging a specialized library, we can automate the entire image‑extraction workflow, turning a manual chore into a swift, script‑driven operation. This tutorial guides you through building a robust Python script to bulk‑extract images from Word documents, boosting productivity and ensuring accuracy.
Why a Library Is Needed
Word documents (especially the modern .docx format) are not simple text files. They are ZIP archives containing multiple XML files, media files, and other resources that define the document’s structure, content, and styling. Direct text parsing is insufficient for extracting embedded objects like images. A programmatic approach requires a library that can understand and navigate this intricate structure.
Introducing Spire.Doc for Python
We will use Spire.Doc for Python, a library designed for creating, reading, editing, and converting Word documents within Python applications. Its comprehensive API lets developers access and manipulate various document elements—including paragraphs, tables, shapes, and, crucially, embedded images—making image extraction straightforward.
1. Install the Library
pip install spire.doc
2. Set Up the Project Structure
Create two folders in the same directory as your script:
InputDocuments– place the Word files you want to process here.ExtractedImages– extracted images will be saved to this folder.
3. Import Modules & Define Directories
import os
import queue # For traversing document elements
from spire.doc import *
from spire.doc.common import *
# Define input and output directories
INPUT_DIR = "InputDocuments" # Folder containing your Word documents
OUTPUT_DIR = "ExtractedImages" # Folder to save extracted images
# Create output directory if it doesn't exist
if not os.path.exists(OUTPUT_DIR):
os.makedirs(OUTPUT_DIR)
print(f"Input directory: {os.path.abspath(INPUT_DIR)}")
print(f"Output directory: {os.path.abspath(OUTPUT_DIR)}")
4. Core Extraction Logic
def extract_images_from_doc(doc_path, output_folder):
"""
Extracts images from a single Word document and saves them to a specified folder.
"""
document = Document()
try:
document.LoadFromFile(doc_path)
except Exception as e:
print(f"Error loading document {doc_path}: {e}")
return
extracted_images_count = 0
nodes = queue.Queue()
nodes.put(document)
while not nodes.empty():
node = nodes.get()
# Iterate through all child objects of the current node
for i in range(node.ChildObjects.Count):
child = node.ChildObjects.get_Item(i)
# -------------------------------------------------
# Image extraction
# -------------------------------------------------
if child.DocumentObjectType == DocumentObjectType.Picture:
picture = child if isinstance(child, DocPicture) else None
if picture is not None:
image_bytes = picture.ImageBytes
# Construct a unique filename (default to PNG)
image_filename = f"image_{extracted_images_count + 1}.png"
image_filepath = os.path.join(output_folder, image_filename)
try:
with open(image_filepath, "wb") as img_file:
img_file.write(image_bytes)
extracted_images_count += 1
print(f" Extracted: {image_filepath}")
except Exception as e:
print(f" Error saving image to {image_filepath}: {e}")
# -------------------------------------------------
# Continue traversal for composite objects
# -------------------------------------------------
elif isinstance(child, ICompositeObject):
nodes.put(child)
document.Close()
print(f"Finished processing '{os.path.basename(doc_path)}' – "
f"{extracted_images_count} image(s) extracted.")
5. Batch Processing Multiple Documents
def batch_extract_images(input_dir, output_dir):
"""
Processes all .docx files in the input directory,
extracting images from each into the output directory.
"""
for filename in os.listdir(input_dir):
if filename.lower().endswith(('.docx', '.doc')):
doc_path = os.path.join(input_dir, filename)
# Create a subfolder for each document's images (optional)
doc_output_folder = os.path.join(output_dir, os.path.splitext(filename)[0])
if not os.path.exists(doc_output_folder):
os.makedirs(doc_output_folder)
print(f"\nProcessing: {doc_path}")
extract_images_from_doc(doc_path, doc_output_folder)
if __name__ == "__main__":
batch_extract_images(INPUT_DIR, OUTPUT_DIR)
Running the script will:
- Scan
InputDocumentsfor.docx/.docfiles. - Create a dedicated subfolder for each document inside
ExtractedImages. - Extract every embedded image and save it with a sequential name (
image_1.png,image_2.png, …).
6. Summary
- Problem: Manual image extraction from many Word files is inefficient.
- Solution: Use Spire.Doc for Python to programmatically traverse document structures and save embedded images.
- Result: A reusable script that bulk‑extracts images, saving time and reducing errors.
Feel free to adapt the script (e.g., change the naming convention, support additional formats, or integrate with other workflows). Happy coding!
Extracting Images from Word Documents with Spire.Doc for Python
Below is a complete, ready‑to‑run script that:
- Scans a folder for all
.docxfiles. - Creates a dedicated sub‑folder for each document inside an
ExtractedImagesdirectory. - Extracts every embedded picture (including those inside tables, text boxes, etc.) and saves it to the appropriate folder.
1. Imports & Global Paths
import os
import sys
from spire.doc import Document, DocumentObjectType
# ── USER‑CONFIGURABLE PATHS ────────────────────────────────────────
INPUT_DIR = "WordDocs" # Folder that contains the .docx files
OUTPUT_DIR = "ExtractedImages" # Where extracted images will be saved
# ───────────────────────────────────────────────────────────────────────
2. Helper: Extract Images from a Single Document
def extract_images_from_doc(doc_path: str, output_folder: str) -> int:
"""
Extracts all images from a single Word document and saves them to
`output_folder`. Returns the number of images extracted.
"""
try:
document = Document()
document.LoadFromFile(doc_path)
except Exception as e:
print(f"[ERROR] Could not load '{doc_path}': {e}")
return 0
# Ensure the output folder exists
os.makedirs(output_folder, exist_ok=True)
extracted_images_count = 0
queue = list(document.Sections[0].Body.ChildObjects) # start traversal
while queue:
child = queue.pop(0)
# Enqueue nested children (e.g., table cells, text boxes)
if hasattr(child, "ChildObjects") and child.ChildObjects:
queue.extend(child.ChildObjects)
# Identify picture objects
if child.DocumentObjectType == DocumentObjectType.Picture:
picture = child
img_bytes = picture.ImageBytes
# Determine a file extension – default to .png if unknown
ext = ".png"
if hasattr(picture, "ImageType"):
ext = f".{picture.ImageType.name.lower()}"
img_name = f"image_{extracted_images_count + 1}{ext}"
img_path = os.path.join(output_folder, img_name)
try:
with open(img_path, "wb") as f:
f.write(img_bytes)
extracted_images_count += 1
print(f" → Saved: {img_name}")
except Exception as e:
print(f"[ERROR] Could not save image '{img_name}': {e}")
document.Close()
print(f". Extracted {extracted_images_count} images.")
return extracted_images_count
How it works
| Step | Explanation |
|---|---|
| Load document | document.LoadFromFile(doc_path) reads the Word file. |
| Queue‑based traversal | Guarantees that nested structures (tables, text boxes, etc.) are visited. |
| Identify images | child.DocumentObjectType == DocumentObjectType.Picture flags picture objects. |
| Retrieve bytes | picture.ImageBytes returns the raw image data. |
| Save to disk | Bytes are written to a file; default extension is .png unless picture.ImageType provides a better hint. |
| Error handling | Try/except blocks catch loading or saving failures and report them. |
3. Bulk Processing: All .docx Files in a Folder
def bulk_extract_images(input_dir: str, output_dir: str) -> None:
"""
Walks through `input_dir`, extracts images from every .docx file,
and stores them in sub‑folders under `output_dir`.
"""
total_images_extracted = 0
# Gather all .docx files
doc_files = [f for f in os.listdir(input_dir) if f.lower().endswith(".docx")]
if not doc_files:
print(f"No .docx files found in '{input_dir}'.")
return
print(f"\nFound {len(doc_files)} Word document(s) to process.\n")
for doc_file in doc_files:
full_doc_path = os.path.join(input_dir, doc_file)
# Create a folder named after the document (without extension)
doc_name = os.path.splitext(doc_file)[0]
doc_output_folder = os.path.join(output_dir, doc_name)
os.makedirs(doc_output_folder, exist_ok=True)
print(f"Processing '{doc_file}' …")
images_count = extract_images_from_doc(full_doc_path, doc_output_folder)
total_images_extracted += images_count
print(f"\n✅ Bulk extraction complete – total images extracted: {total_images_extracted}")
4. Main Execution Block
if __name__ == "__main__":
# -----------------------------------------------------------------
# Create the input folder if it does not exist (for demo purposes)
# -----------------------------------------------------------------
if not os.path.isdir(INPUT_DIR):
os.makedirs(INPUT_DIR, exist_ok=True)
print(f"Created '{INPUT_DIR}'. Please add .docx files here before re‑running.")
sys.exit(0)
# Run the bulk extractor
bulk_extract_images(INPUT_DIR, OUTPUT_DIR)
5. Quick Reference: Spire.Doc for Python API Used
| Method / Property | Description |
|---|---|
Document() | Creates a new Word document object. |
document.LoadFromFile(path) | Loads a Word file from disk. |
document.Sections[0].Body.ChildObjects | Returns the top‑level child objects (paragraphs, tables, pictures, …). |
DocumentObjectType.Picture | Enum value that identifies picture objects. |
picture.ImageBytes | Raw byte array of the embedded image. |
document.Close() | Releases resources held by the document. |
Next Steps & Enhancements
- Fine‑grained error handling – Catch specific exceptions (
FileNotFoundError,IOError, etc.) and log them to a file. - Custom naming scheme – Include the original document name, a timestamp, or a hash in the image filename for easier traceability.
- Format conversion – Use Pillow (
pip install pillow) to convert all extracted images to a uniform format (e.g., JPEG). - Parallel processing – For thousands of documents, wrap
extract_images_from_docin amultiprocessing.Poolto speed up execution.
TL;DR
The script above automates the tedious task of pulling every picture out of Word files. By simply dropping .docx files into the WordDocs folder and running the script, you’ll end up with a clean directory tree like:
ExtractedImages/
├─ Report_Q1/
│ ├─ image_1.png
│ └─ image_2.jpg
├─ Invoice_2024/
│ ├─ image_1.png
│ └─ image_2.png
└─ …
Feel free to adapt the code to your workflow, add logging, or extend it with additional Spire.Doc features. Happy automating!