⚡️ Supercharge Your Document Workflows: Docling Now Unleashes the Power of NVIDIA RTX!

Published: (January 6, 2026 at 01:52 PM EST)
6 min read
Source: Dev.to

Source: Dev.to

What Is NVIDIA RTX?

NVIDIA RTX (Ray Tracing Texel eXtreme) is a professional visual‑computing platform that revolutionized digital rendering by introducing specialized hardware for real‑time ray tracing and artificial intelligence. Built on modern architectures such as Blackwell, Ada Lovelace, and Ampere, RTX GPUs feature:

  • RT Cores – Simulate the physical behavior of light (ray bounce, reflection, shadows).
  • Tensor Cores – Accelerate AI tasks (e.g., DLSS for frame‑rate boosting).

Beyond cinematic gaming, RTX provides a massive performance leap for creators and researchers, enabling neural rendering and high‑throughput data processing that can be up to six times faster than traditional CPU‑based workflows.

Why Use RTX with Docling?

By shifting the heavy lifting from your CPU to an NVIDIA RTX GPU, you can experience up to 6× speed‑up in processing times. This isn’t just a minor tweak—it’s a performance leap that transforms how you handle:

Use‑CaseBenefit
Large BatchesProcess thousands of pages in a fraction of the time.
High‑Throughput WorkflowsKeep production pipelines moving at lightning speed.
Advanced ModelsExperiment with complex document‑understanding models without lag.

Docling is designed to be plug‑and‑play. Once you have the NVIDIA drivers, CUDA Toolkit, and cuDNN installed, Docling will automatically detect and use your RTX GPU.

Quick Setup

1. Verify Your Hardware

nvidia-smi

Make sure the driver version shown matches the CUDA version you plan to install.

2. Install PyTorch with CUDA Support

Replace the URL with the one that matches your CUDA toolkit version.

For CUDA 12.8

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

For CUDA 13.0

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130

3. Run Docling

from docling.document_converter import DocumentConverter

converter = DocumentConverter()   # Automatically detects GPU!
result = converter.convert("document.pdf")

GPU‑Specific Batch‑Size Recommendations

RTX ModelVRAMSuggested OCR / Layout Batch Size
RTX 509032 GB64 – 128
RTX 409024 GB32 – 64
RTX 507012 GB16 – 32

OS‑Specific Guidance

FeatureWindows 10/11Linux (Ubuntu/Debian, etc.)
Driver installManual download from the NVIDIA website.Use apt/dnf or download from the NVIDIA site.
VerificationRun nvidia-smi in PowerShell or CMD.Run nvidia-smi in a terminal.
VLM inferencellama-server (llama.cpp) – recommended.vLLM – high‑performance recommendation.
Max performancePossible via WSL2 (Windows Subsystem for Linux).Native performance on Linux.

Note: The PyTorch installation command is identical on both platforms; just make sure the CUDA toolkit version matches the driver you installed.

Vision‑Language Model (VLM) Inference

Linux (vLLM) – ~4× Faster Than llama-server

vllm serve ibm-granite/granite-docling-258M \
    --host 127.0.0.1 \
    --port 8000 \
    --gpu-memory-utilization 0.9

Windows (llama‑server)

.\llama-server.exe --hf-repo ibm-granite/granite-docling-258M-GGUF -ngl -1 --port 8000

💡 Quick Troubleshooting Tip

import torch

print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")

The script above checks whether CUDA is available and prints the name of the detected GPU. Use this information to verify your VRAM availability and adjust batch sizes for optimal throughput.

Automatic Optimisation Script

import torch
from docling.document_converter import DocumentConverter
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
from docling.datamodel.pipeline_options import ThreadedPdfPipelineOptions


def get_optimal_settings():
    """Detect GPU and choose appropriate batch sizes."""
    if not torch.cuda.is_available():
        print("CUDA not found. Falling back to CPU.")
        return None, None

    # Determine VRAM to pick the best batch size
    vram_gb = torch.cuda.get_device_properties(0).total_memory / 1024**3
    print(f"Detected GPU: {torch.cuda.get_device_name(0)} ({vram_gb:.2f} GB VRAM)")

    # Tuning logic based on hardware tiers
    if vram_gb > 24:          # e.g., RTX 5090 (32 GB)
        b_size = 128
    elif vram_gb >= 20:       # e.g., RTX 4090 (24 GB)
        b_size = 64
    else:                     # e.g., RTX 5070 (12 GB) or lower
        b_size = 16

    acc_options = AcceleratorOptions(device=AcceleratorDevice.CUDA)

    pipe_options = ThreadedPdfPipelineOptions(
        ocr_batch_size=b_size,
        layout_batch_size=b_size,
        table_batch_size=4   # Tables are memory‑intensive
    )

    return acc_options, pipe_options


# Initialise with optimized settings
acc_opts, pipe_opts = get_optimal_settings()

converter = DocumentConverter(
    accelerator_options=acc_opts,
    pipeline_options=pipe_opts
)

# Example conversion
result = converter.convert("document.pdf")

That’s it! With the right NVIDIA RTX GPU and a few simple steps, Docling can process massive document collections at unprecedented speed. 🚀

Document Conversion Example

converter = DocumentConverter(
    accelerator_options=acc_opts,
    pipeline_options=pipe_opts,
)

# Convert your document
result = converter.convert("large_document.pdf")
print("Conversion complete!")

Simpler Example (no detections)

import datetime
import logging
import time
from pathlib import Path

import numpy as np
from pydantic import TypeAdapter

from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
from docling.datamodel.base_models import ConversionStatus, InputFormat
from docling.datamodel.pipeline_options import ThreadedPdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.pipeline.threaded_standard_pdf_pipeline import ThreadedStandardPdfPipeline
from docling.utils.profiling import ProfilingItem

_log = logging.getLogger(__name__)

def main() -> None:
    # Reduce noise from the library logger
    logging.getLogger("docling").setLevel(logging.WARNING)
    _log.setLevel(logging.INFO)

    data_folder = Path(__file__).parent / "../../tests/data"
    # input_doc_path = data_folder / "pdf" / "2305.03393v1.pdf"  # 14 pages
    input_doc_path = data_folder / "pdf" / "redp5110_sampled.pdf"  # 18 pages

    pipeline_options = ThreadedPdfPipelineOptions(
        accelerator_options=AcceleratorOptions(device=AcceleratorDevice.CUDA),
        ocr_batch_size=4,
        layout_batch_size=64,
        table_batch_size=4,
    )
    pipeline_options.do_ocr = False

    doc_converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_cls=ThreadedStandardPdfPipeline,
                pipeline_options=pipeline_options,
            )
        }
    )

    # Initialise pipeline
    start_time = time.time()
    doc_converter.initialize_pipeline(InputFormat.PDF)
    init_runtime = time.time() - start_time
    _log.info(f"Pipeline initialized in {init_runtime:.2f} seconds.")

    # Convert document
    start_time = time.time()
    conv_result = doc_converter.convert(input_doc_path)
    pipeline_runtime = time.time() - start_time
    assert conv_result.status == ConversionStatus.SUCCESS

    num_pages = len(conv_result.pages)
    _log.info(f"Document converted in {pipeline_runtime:.2f} seconds.")
    _log.info(f"  {num_pages / pipeline_runtime:.2f} pages/second.")

if __name__ == "__main__":
    main()

Tips for Maximising GPU Utilisation

  • Memory Monitoring – Run nvidia-smi -l 1 while the script is executing to watch VRAM usage.
  • vLLM on Linux – The vLLM pipeline delivers roughly better performance for Vision‑Language Models (VLMs) on Linux compared with Windows.
  • Clear Cache – When processing many large files, call torch.cuda.empty_cache() between conversions to avoid “Out of Memory” errors.

Why Use a Dedicated vLLM Server?

The RTX 5090’s 32 GB GDDR7 VRAM can be fully exploited only with a server‑side vLLM deployment. This setup can give you up to speed‑up for models such as granite‑docling‑258M.

Launch the vLLM Server (Optimised for 32 GB VRAM)

vllm serve ibm-granite/granite-docling-258M \
  --revision untied \
  --host 127.0.0.1 \
  --port 8000 \
  --gpu-memory-utilization 0.9 \
  --max-num-seqs 1024 \
  --max-num-batched-tokens 16384 \
  --enable-chunked-prefill

Explanation of flags

FlagReason
--revision untiedRequired for compatibility with current vLLM versions and the granite‑docling architecture.
--gpu-memory-utilization 0.9Allocates 90 % of the 32 GB VRAM to the model + KV cache.
--max-num-seqs 1024Leverages the RTX 5090’s massive core count for high‑parallel sequence processing.
--max-num-batched-tokens 16384Enables large‑batch inference without crashing.
--enable-chunked-prefillUses PagedAttention for faster “prefill” (reading document pages).

Tip: If you encounter OOM errors with very complex documents, lower --gpu-memory-utilization to 0.8.

Connect Docling to the vLLM Server

from docling.document_converter import DocumentConverter
from docling.datamodel.pipeline_options import PdfPipelineOptions, VlmOptions

# 1. Configure the VLM to point to your local vLLM server
vlm_options = VlmOptions(
    server_url="http://127.0.0.1:8000/v1",
    model_id="ibm-granite/granite-docling-258M",
)

# 2. Set the pipeline to use the server‑based VLM
pipeline_options = PdfPipelineOptions()
pipeline_options.vlm_options = vlm_options

# 3. Initialise the converter
converter = DocumentConverter(pipeline_options=pipeline_options)

# 4. Run high‑speed conversion
result = converter.convert("massive_report.pdf")
print(result.document.export_to_markdown())

Key Benefits

  • Massive Batching – vLLM’s PagedAttention lets the RTX 5090 handle far larger page batches than standard inference.
  • GDDR7 Speed – Higher memory bandwidth accelerates the prefilling stage (reading each page).
  • Blackwell Architecture – Takes advantage of CUDA 12.8 optimisations specific to the 50‑series GPUs, avoiding legacy‑mode penalties.

Further Resources

  • Original launch‑command guideLink
  • Docling DocumentationLink
  • Docling Project RepositoryLink
  • GPU Support OverviewLink
  • GPU Performance ExamplesLink

Ready to level up? Check the Docling GPU Support Guide for more examples and troubleshooting tips.

  • [NVIDIA Driver Downloads]()
  • [NVIDIA CUDA Downloads]()
  • [NVIDIA cuDNN Installation]()
  • [Python Compatibility Matrix (PyTorch)]()
  • [Llama.cpp Repository]()
Back to Blog

Related posts

Read more »