⚡️ Supercharge Your Document Workflows: Docling Now Unleashes the Power of NVIDIA RTX!

Published: 1 month ago (January 6, 2026 at 01:52 PM EST)

6 min read

Source: Dev.to

What Is NVIDIA RTX?

NVIDIA RTX (Ray Tracing Texel eXtreme) is a professional visual‑computing platform that revolutionized digital rendering by introducing specialized hardware for real‑time ray tracing and artificial intelligence. Built on modern architectures such as Blackwell, Ada Lovelace, and Ampere, RTX GPUs feature:

RT Cores – Simulate the physical behavior of light (ray bounce, reflection, shadows).
Tensor Cores – Accelerate AI tasks (e.g., DLSS for frame‑rate boosting).

Beyond cinematic gaming, RTX provides a massive performance leap for creators and researchers, enabling neural rendering and high‑throughput data processing that can be up to six times faster than traditional CPU‑based workflows.

Why Use RTX with Docling?

By shifting the heavy lifting from your CPU to an NVIDIA RTX GPU, you can experience up to 6× speed‑up in processing times. This isn’t just a minor tweak—it’s a performance leap that transforms how you handle:

Use‑Case	Benefit
Large Batches	Process thousands of pages in a fraction of the time.
High‑Throughput Workflows	Keep production pipelines moving at lightning speed.
Advanced Models	Experiment with complex document‑understanding models without lag.

Docling is designed to be plug‑and‑play. Once you have the NVIDIA drivers, CUDA Toolkit, and cuDNN installed, Docling will automatically detect and use your RTX GPU.

Quick Setup

1. Verify Your Hardware

nvidia-smi

Make sure the driver version shown matches the CUDA version you plan to install.

2. Install PyTorch with CUDA Support

Replace the URL with the one that matches your CUDA toolkit version.

For CUDA 12.8

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

For CUDA 13.0

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130

3. Run Docling

from docling.document_converter import DocumentConverter

converter = DocumentConverter()   # Automatically detects GPU!
result = converter.convert("document.pdf")

GPU‑Specific Batch‑Size Recommendations

RTX Model	VRAM	Suggested OCR / Layout Batch Size
RTX 5090	32 GB	64 – 128
RTX 4090	24 GB	32 – 64
RTX 5070	12 GB	16 – 32

OS‑Specific Guidance

Feature	Windows 10/11	Linux (Ubuntu/Debian, etc.)
Driver install	Manual download from the NVIDIA website.	Use `apt`/`dnf` or download from the NVIDIA site.
Verification	Run `nvidia-smi` in PowerShell or CMD.	Run `nvidia-smi` in a terminal.
VLM inference	`llama-server` (llama.cpp) – recommended.	`vLLM` – high‑performance recommendation.
Max performance	Possible via WSL2 (Windows Subsystem for Linux).	Native performance on Linux.

Note: The PyTorch installation command is identical on both platforms; just make sure the CUDA toolkit version matches the driver you installed.

Vision‑Language Model (VLM) Inference

Linux (vLLM) – ~4× Faster Than `llama-server`

vllm serve ibm-granite/granite-docling-258M \
    --host 127.0.0.1 \
    --port 8000 \
    --gpu-memory-utilization 0.9

Windows (llama‑server)

.\llama-server.exe --hf-repo ibm-granite/granite-docling-258M-GGUF -ngl -1 --port 8000

💡 Quick Troubleshooting Tip

import torch

print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")

The script above checks whether CUDA is available and prints the name of the detected GPU. Use this information to verify your VRAM availability and adjust batch sizes for optimal throughput.

Automatic Optimisation Script

import torch
from docling.document_converter import DocumentConverter
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
from docling.datamodel.pipeline_options import ThreadedPdfPipelineOptions


def get_optimal_settings():
    """Detect GPU and choose appropriate batch sizes."""
    if not torch.cuda.is_available():
        print("CUDA not found. Falling back to CPU.")
        return None, None

    # Determine VRAM to pick the best batch size
    vram_gb = torch.cuda.get_device_properties(0).total_memory / 1024**3
    print(f"Detected GPU: {torch.cuda.get_device_name(0)} ({vram_gb:.2f} GB VRAM)")

    # Tuning logic based on hardware tiers
    if vram_gb > 24:          # e.g., RTX 5090 (32 GB)
        b_size = 128
    elif vram_gb >= 20:       # e.g., RTX 4090 (24 GB)
        b_size = 64
    else:                     # e.g., RTX 5070 (12 GB) or lower
        b_size = 16

    acc_options = AcceleratorOptions(device=AcceleratorDevice.CUDA)

    pipe_options = ThreadedPdfPipelineOptions(
        ocr_batch_size=b_size,
        layout_batch_size=b_size,
        table_batch_size=4   # Tables are memory‑intensive
    )

    return acc_options, pipe_options


# Initialise with optimized settings
acc_opts, pipe_opts = get_optimal_settings()

converter = DocumentConverter(
    accelerator_options=acc_opts,
    pipeline_options=pipe_opts
)

# Example conversion
result = converter.convert("document.pdf")

That’s it! With the right NVIDIA RTX GPU and a few simple steps, Docling can process massive document collections at unprecedented speed. 🚀

Document Conversion Example

converter = DocumentConverter(
    accelerator_options=acc_opts,
    pipeline_options=pipe_opts,
)

# Convert your document
result = converter.convert("large_document.pdf")
print("Conversion complete!")

Simpler Example (no detections)

import datetime
import logging
import time
from pathlib import Path

import numpy as np
from pydantic import TypeAdapter

from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
from docling.datamodel.base_models import ConversionStatus, InputFormat
from docling.datamodel.pipeline_options import ThreadedPdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.pipeline.threaded_standard_pdf_pipeline import ThreadedStandardPdfPipeline
from docling.utils.profiling import ProfilingItem

_log = logging.getLogger(__name__)

def main() -> None:
    # Reduce noise from the library logger
    logging.getLogger("docling").setLevel(logging.WARNING)
    _log.setLevel(logging.INFO)

    data_folder = Path(__file__).parent / "../../tests/data"
    # input_doc_path = data_folder / "pdf" / "2305.03393v1.pdf"  # 14 pages
    input_doc_path = data_folder / "pdf" / "redp5110_sampled.pdf"  # 18 pages

    pipeline_options = ThreadedPdfPipelineOptions(
        accelerator_options=AcceleratorOptions(device=AcceleratorDevice.CUDA),
        ocr_batch_size=4,
        layout_batch_size=64,
        table_batch_size=4,
    )
    pipeline_options.do_ocr = False

    doc_converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_cls=ThreadedStandardPdfPipeline,
                pipeline_options=pipeline_options,
            )
        }
    )

    # Initialise pipeline
    start_time = time.time()
    doc_converter.initialize_pipeline(InputFormat.PDF)
    init_runtime = time.time() - start_time
    _log.info(f"Pipeline initialized in {init_runtime:.2f} seconds.")

    # Convert document
    start_time = time.time()
    conv_result = doc_converter.convert(input_doc_path)
    pipeline_runtime = time.time() - start_time
    assert conv_result.status == ConversionStatus.SUCCESS

    num_pages = len(conv_result.pages)
    _log.info(f"Document converted in {pipeline_runtime:.2f} seconds.")
    _log.info(f"  {num_pages / pipeline_runtime:.2f} pages/second.")

if __name__ == "__main__":
    main()

Tips for Maximising GPU Utilisation

Memory Monitoring – Run nvidia-smi -l 1 while the script is executing to watch VRAM usage.
vLLM on Linux – The vLLM pipeline delivers roughly 4× better performance for Vision‑Language Models (VLMs) on Linux compared with Windows.
Clear Cache – When processing many large files, call torch.cuda.empty_cache() between conversions to avoid “Out of Memory” errors.

Why Use a Dedicated vLLM Server?

The RTX 5090’s 32 GB GDDR7 VRAM can be fully exploited only with a server‑side vLLM deployment. This setup can give you up to 4× speed‑up for models such as granite‑docling‑258M.

Launch the vLLM Server (Optimised for 32 GB VRAM)

vllm serve ibm-granite/granite-docling-258M \
  --revision untied \
  --host 127.0.0.1 \
  --port 8000 \
  --gpu-memory-utilization 0.9 \
  --max-num-seqs 1024 \
  --max-num-batched-tokens 16384 \
  --enable-chunked-prefill

Explanation of flags

Flag	Reason
`--revision untied`	Required for compatibility with current vLLM versions and the granite‑docling architecture.
`--gpu-memory-utilization 0.9`	Allocates 90 % of the 32 GB VRAM to the model + KV cache.
`--max-num-seqs 1024`	Leverages the RTX 5090’s massive core count for high‑parallel sequence processing.
`--max-num-batched-tokens 16384`	Enables large‑batch inference without crashing.
`--enable-chunked-prefill`	Uses PagedAttention for faster “prefill” (reading document pages).

Tip: If you encounter OOM errors with very complex documents, lower --gpu-memory-utilization to 0.8.

Connect Docling to the vLLM Server

from docling.document_converter import DocumentConverter
from docling.datamodel.pipeline_options import PdfPipelineOptions, VlmOptions

# 1. Configure the VLM to point to your local vLLM server
vlm_options = VlmOptions(
    server_url="http://127.0.0.1:8000/v1",
    model_id="ibm-granite/granite-docling-258M",
)

# 2. Set the pipeline to use the server‑based VLM
pipeline_options = PdfPipelineOptions()
pipeline_options.vlm_options = vlm_options

# 3. Initialise the converter
converter = DocumentConverter(pipeline_options=pipeline_options)

# 4. Run high‑speed conversion
result = converter.convert("massive_report.pdf")
print(result.document.export_to_markdown())

Key Benefits

Massive Batching – vLLM’s PagedAttention lets the RTX 5090 handle far larger page batches than standard inference.
GDDR7 Speed – Higher memory bandwidth accelerates the prefilling stage (reading each page).
Blackwell Architecture – Takes advantage of CUDA 12.8 optimisations specific to the 50‑series GPUs, avoiding legacy‑mode penalties.

Further Resources

Original launch‑command guide – Link
Docling Documentation – Link
Docling Project Repository – Link
GPU Support Overview – Link
GPU Performance Examples – Link

Ready to level up? Check the Docling GPU Support Guide for more examples and troubleshooting tips.

Useful Links

[NVIDIA Driver Downloads]()
[NVIDIA CUDA Downloads]()
[NVIDIA cuDNN Installation]()
[Python Compatibility Matrix (PyTorch)]()
[Llama.cpp Repository]()

⚡️ Supercharge Your Document Workflows: Docling Now Unleashes the Power of NVIDIA RTX!

What Is NVIDIA RTX?

Why Use RTX with Docling?

Quick Setup

1. Verify Your Hardware

2. Install PyTorch with CUDA Support

3. Run Docling

GPU‑Specific Batch‑Size Recommendations

OS‑Specific Guidance

Vision‑Language Model (VLM) Inference

Linux (vLLM) – ~4× Faster Than `llama-server`

Windows (llama‑server)

💡 Quick Troubleshooting Tip

Automatic Optimisation Script

Document Conversion Example

Simpler Example (no detections)

Tips for Maximising GPU Utilisation

Why Use a Dedicated vLLM Server?

Launch the vLLM Server (Optimised for 32 GB VRAM)

Explanation of flags

Connect Docling to the vLLM Server

Key Benefits

Further Resources

Useful Links

Related posts

Measuring What Matters: Adding Multiple Dimension Sets to AWS Lambda Powertools

Mastering Django Image Migrations: Local to S3, CDNs, and Beyond!

The Silent Registration Killer: When Auto-Formatters and Linters Collide

FastAPI from Zero: Writing Your First API Route

What Is NVIDIA RTX?

Why Use RTX with Docling?

Quick Setup

1. Verify Your Hardware

2. Install PyTorch with CUDA Support

3. Run Docling

GPU‑Specific Batch‑Size Recommendations

OS‑Specific Guidance

Vision‑Language Model (VLM) Inference

Linux (vLLM) – ~4× Faster Than llama-server

Windows (llama‑server)

💡 Quick Troubleshooting Tip

Automatic Optimisation Script

Document Conversion Example

Simpler Example (no detections)

Tips for Maximising GPU Utilisation

Why Use a Dedicated vLLM Server?

Launch the vLLM Server (Optimised for 32 GB VRAM)

Explanation of flags

Connect Docling to the vLLM Server

Key Benefits

Further Resources

Useful Links

Related posts

Measuring What Matters: Adding Multiple Dimension Sets to AWS Lambda Powertools

Mastering Django Image Migrations: Local to S3, CDNs, and Beyond!

The Silent Registration Killer: When Auto-Formatters and Linters Collide

FastAPI from Zero: Writing Your First API Route

What Is NVIDIA RTX?

Linux (vLLM) – ~4× Faster Than `llama-server`

Launch the vLLM Server (Optimised for 32 GB VRAM)