⚡️ Supercharge Your Document Workflows: Docling Now Unleashes the Power of NVIDIA RTX!
Source: Dev.to
What Is NVIDIA RTX?
NVIDIA RTX (Ray Tracing Texel eXtreme) is a professional visual‑computing platform that revolutionized digital rendering by introducing specialized hardware for real‑time ray tracing and artificial intelligence. Built on modern architectures such as Blackwell, Ada Lovelace, and Ampere, RTX GPUs feature:
- RT Cores – Simulate the physical behavior of light (ray bounce, reflection, shadows).
- Tensor Cores – Accelerate AI tasks (e.g., DLSS for frame‑rate boosting).
Beyond cinematic gaming, RTX provides a massive performance leap for creators and researchers, enabling neural rendering and high‑throughput data processing that can be up to six times faster than traditional CPU‑based workflows.
Why Use RTX with Docling?
By shifting the heavy lifting from your CPU to an NVIDIA RTX GPU, you can experience up to 6× speed‑up in processing times. This isn’t just a minor tweak—it’s a performance leap that transforms how you handle:
| Use‑Case | Benefit |
|---|---|
| Large Batches | Process thousands of pages in a fraction of the time. |
| High‑Throughput Workflows | Keep production pipelines moving at lightning speed. |
| Advanced Models | Experiment with complex document‑understanding models without lag. |
Docling is designed to be plug‑and‑play. Once you have the NVIDIA drivers, CUDA Toolkit, and cuDNN installed, Docling will automatically detect and use your RTX GPU.
Quick Setup
1. Verify Your Hardware
nvidia-smi
Make sure the driver version shown matches the CUDA version you plan to install.
2. Install PyTorch with CUDA Support
Replace the URL with the one that matches your CUDA toolkit version.
For CUDA 12.8
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
For CUDA 13.0
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
3. Run Docling
from docling.document_converter import DocumentConverter
converter = DocumentConverter() # Automatically detects GPU!
result = converter.convert("document.pdf")
GPU‑Specific Batch‑Size Recommendations
| RTX Model | VRAM | Suggested OCR / Layout Batch Size |
|---|---|---|
| RTX 5090 | 32 GB | 64 – 128 |
| RTX 4090 | 24 GB | 32 – 64 |
| RTX 5070 | 12 GB | 16 – 32 |
OS‑Specific Guidance
| Feature | Windows 10/11 | Linux (Ubuntu/Debian, etc.) |
|---|---|---|
| Driver install | Manual download from the NVIDIA website. | Use apt/dnf or download from the NVIDIA site. |
| Verification | Run nvidia-smi in PowerShell or CMD. | Run nvidia-smi in a terminal. |
| VLM inference | llama-server (llama.cpp) – recommended. | vLLM – high‑performance recommendation. |
| Max performance | Possible via WSL2 (Windows Subsystem for Linux). | Native performance on Linux. |
Note: The PyTorch installation command is identical on both platforms; just make sure the CUDA toolkit version matches the driver you installed.
Vision‑Language Model (VLM) Inference
Linux (vLLM) – ~4× Faster Than llama-server
vllm serve ibm-granite/granite-docling-258M \
--host 127.0.0.1 \
--port 8000 \
--gpu-memory-utilization 0.9
Windows (llama‑server)
.\llama-server.exe --hf-repo ibm-granite/granite-docling-258M-GGUF -ngl -1 --port 8000
💡 Quick Troubleshooting Tip
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
The script above checks whether CUDA is available and prints the name of the detected GPU. Use this information to verify your VRAM availability and adjust batch sizes for optimal throughput.
Automatic Optimisation Script
import torch
from docling.document_converter import DocumentConverter
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
from docling.datamodel.pipeline_options import ThreadedPdfPipelineOptions
def get_optimal_settings():
"""Detect GPU and choose appropriate batch sizes."""
if not torch.cuda.is_available():
print("CUDA not found. Falling back to CPU.")
return None, None
# Determine VRAM to pick the best batch size
vram_gb = torch.cuda.get_device_properties(0).total_memory / 1024**3
print(f"Detected GPU: {torch.cuda.get_device_name(0)} ({vram_gb:.2f} GB VRAM)")
# Tuning logic based on hardware tiers
if vram_gb > 24: # e.g., RTX 5090 (32 GB)
b_size = 128
elif vram_gb >= 20: # e.g., RTX 4090 (24 GB)
b_size = 64
else: # e.g., RTX 5070 (12 GB) or lower
b_size = 16
acc_options = AcceleratorOptions(device=AcceleratorDevice.CUDA)
pipe_options = ThreadedPdfPipelineOptions(
ocr_batch_size=b_size,
layout_batch_size=b_size,
table_batch_size=4 # Tables are memory‑intensive
)
return acc_options, pipe_options
# Initialise with optimized settings
acc_opts, pipe_opts = get_optimal_settings()
converter = DocumentConverter(
accelerator_options=acc_opts,
pipeline_options=pipe_opts
)
# Example conversion
result = converter.convert("document.pdf")
That’s it! With the right NVIDIA RTX GPU and a few simple steps, Docling can process massive document collections at unprecedented speed. 🚀
Document Conversion Example
converter = DocumentConverter(
accelerator_options=acc_opts,
pipeline_options=pipe_opts,
)
# Convert your document
result = converter.convert("large_document.pdf")
print("Conversion complete!")
Simpler Example (no detections)
import datetime
import logging
import time
from pathlib import Path
import numpy as np
from pydantic import TypeAdapter
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
from docling.datamodel.base_models import ConversionStatus, InputFormat
from docling.datamodel.pipeline_options import ThreadedPdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.pipeline.threaded_standard_pdf_pipeline import ThreadedStandardPdfPipeline
from docling.utils.profiling import ProfilingItem
_log = logging.getLogger(__name__)
def main() -> None:
# Reduce noise from the library logger
logging.getLogger("docling").setLevel(logging.WARNING)
_log.setLevel(logging.INFO)
data_folder = Path(__file__).parent / "../../tests/data"
# input_doc_path = data_folder / "pdf" / "2305.03393v1.pdf" # 14 pages
input_doc_path = data_folder / "pdf" / "redp5110_sampled.pdf" # 18 pages
pipeline_options = ThreadedPdfPipelineOptions(
accelerator_options=AcceleratorOptions(device=AcceleratorDevice.CUDA),
ocr_batch_size=4,
layout_batch_size=64,
table_batch_size=4,
)
pipeline_options.do_ocr = False
doc_converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_cls=ThreadedStandardPdfPipeline,
pipeline_options=pipeline_options,
)
}
)
# Initialise pipeline
start_time = time.time()
doc_converter.initialize_pipeline(InputFormat.PDF)
init_runtime = time.time() - start_time
_log.info(f"Pipeline initialized in {init_runtime:.2f} seconds.")
# Convert document
start_time = time.time()
conv_result = doc_converter.convert(input_doc_path)
pipeline_runtime = time.time() - start_time
assert conv_result.status == ConversionStatus.SUCCESS
num_pages = len(conv_result.pages)
_log.info(f"Document converted in {pipeline_runtime:.2f} seconds.")
_log.info(f" {num_pages / pipeline_runtime:.2f} pages/second.")
if __name__ == "__main__":
main()
Tips for Maximising GPU Utilisation
- Memory Monitoring – Run
nvidia-smi -l 1while the script is executing to watch VRAM usage. - vLLM on Linux – The vLLM pipeline delivers roughly 4× better performance for Vision‑Language Models (VLMs) on Linux compared with Windows.
- Clear Cache – When processing many large files, call
torch.cuda.empty_cache()between conversions to avoid “Out of Memory” errors.
Why Use a Dedicated vLLM Server?
The RTX 5090’s 32 GB GDDR7 VRAM can be fully exploited only with a server‑side vLLM deployment. This setup can give you up to 4× speed‑up for models such as granite‑docling‑258M.
Launch the vLLM Server (Optimised for 32 GB VRAM)
vllm serve ibm-granite/granite-docling-258M \
--revision untied \
--host 127.0.0.1 \
--port 8000 \
--gpu-memory-utilization 0.9 \
--max-num-seqs 1024 \
--max-num-batched-tokens 16384 \
--enable-chunked-prefill
Explanation of flags
| Flag | Reason |
|---|---|
--revision untied | Required for compatibility with current vLLM versions and the granite‑docling architecture. |
--gpu-memory-utilization 0.9 | Allocates 90 % of the 32 GB VRAM to the model + KV cache. |
--max-num-seqs 1024 | Leverages the RTX 5090’s massive core count for high‑parallel sequence processing. |
--max-num-batched-tokens 16384 | Enables large‑batch inference without crashing. |
--enable-chunked-prefill | Uses PagedAttention for faster “prefill” (reading document pages). |
Tip: If you encounter OOM errors with very complex documents, lower
--gpu-memory-utilizationto0.8.
Connect Docling to the vLLM Server
from docling.document_converter import DocumentConverter
from docling.datamodel.pipeline_options import PdfPipelineOptions, VlmOptions
# 1. Configure the VLM to point to your local vLLM server
vlm_options = VlmOptions(
server_url="http://127.0.0.1:8000/v1",
model_id="ibm-granite/granite-docling-258M",
)
# 2. Set the pipeline to use the server‑based VLM
pipeline_options = PdfPipelineOptions()
pipeline_options.vlm_options = vlm_options
# 3. Initialise the converter
converter = DocumentConverter(pipeline_options=pipeline_options)
# 4. Run high‑speed conversion
result = converter.convert("massive_report.pdf")
print(result.document.export_to_markdown())
Key Benefits
- Massive Batching – vLLM’s PagedAttention lets the RTX 5090 handle far larger page batches than standard inference.
- GDDR7 Speed – Higher memory bandwidth accelerates the prefilling stage (reading each page).
- Blackwell Architecture – Takes advantage of CUDA 12.8 optimisations specific to the 50‑series GPUs, avoiding legacy‑mode penalties.
Further Resources
- Original launch‑command guide – Link
- Docling Documentation – Link
- Docling Project Repository – Link
- GPU Support Overview – Link
- GPU Performance Examples – Link
Ready to level up? Check the Docling GPU Support Guide for more examples and troubleshooting tips.
Useful Links
- [NVIDIA Driver Downloads]()
- [NVIDIA CUDA Downloads]()
- [NVIDIA cuDNN Installation]()
- [Python Compatibility Matrix (PyTorch)]()
- [Llama.cpp Repository]()