⚡️ 为您的文档工作流加速：Docling 现在释放 NVIDIA RTX 的强大性能！

发布: 1个月前 (2026年1月7日 GMT+8 02:52)

9 分钟阅读

Source: Dev.to

请提供您希望翻译的正文内容，我将为您翻译成简体中文。

什么是 NVIDIA RTX？

NVIDIA RTX（Ray Tracing Texel eXtreme）是一个专业的视觉计算平台，通过引入用于实时光线追踪和人工智能的专用硬件，彻底改变了数字渲染。基于 Blackwell、Ada Lovelace 和 Ampere 等现代架构，RTX GPU 具备：

RT Cores – 模拟光线的物理行为（光线反弹、反射、阴影）。
Tensor Cores – 加速 AI 任务（例如用于提升帧率的 DLSS）。

除了电影级游戏外，RTX 为创作者和研究人员提供了巨大的性能提升，使 神经渲染 和高吞吐量数据处理的速度 可比传统基于 CPU 的工作流快六倍。

为什么在 Docling 中使用 RTX？

通过将繁重的计算从 CPU 转移到 NVIDIA RTX GPU，您可以实现 最高 6 倍的加速 处理时间。这不仅是一次小幅调整——它是一次性能飞跃，彻底改变您的处理方式：

使用场景	收益
大批量	在极短时间内处理数千页。
高吞吐工作流	让生产流水线以闪电般的速度运行。
高级模型	在不出现延迟的情况下实验复杂的文档理解模型。

Docling 旨在实现 即插即用。只要安装了 NVIDIA 驱动、CUDA Toolkit 和 cuDNN，Docling 将自动检测并使用您的 RTX GPU。

快速设置

1. 验证硬件

nvidia-smi

确保显示的驱动版本与您计划安装的 CUDA 版本匹配。

2. 安装带 CUDA 支持的 PyTorch

将 URL 替换为与您的 CUDA 工具包版本相匹配的版本。

针对 CUDA 12.8

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

针对 CUDA 13.0

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130

3. 运行 Docling

from docling.document_converter import DocumentConverter

converter = DocumentConverter()   # 自动检测 GPU!
result = converter.convert("document.pdf")

GPU‑特定批量大小建议

RTX 型号	显存	建议的 OCR / 布局批量大小
RTX 5090	32 GB	64 – 128
RTX 4090	24 GB	32 – 64
RTX 5070	12 GB	16 – 32

操作系统特定指南

功能	Windows 10/11	Linux (Ubuntu/Debian, 等)
驱动安装	手动从 NVIDIA 网站下载。	使用 `apt`/`dnf` 或从 NVIDIA 网站下载。
验证	在 PowerShell 或 CMD 中运行 `nvidia-smi`。	在终端中运行 `nvidia-smi`。
VLM 推理	`llama-server`（llama.cpp）– 推荐。	`vLLM` – 高性能推荐。
最高性能	可通过 WSL2（Windows Subsystem for Linux）实现。	在 Linux 上的原生性能。

注意： PyTorch 安装命令在两个平台上相同；只需确保 CUDA 工具包版本与您安装的驱动程序匹配。

视觉语言模型 (VLM) 推理

Linux (vLLM) – 比 `llama-server` 快约 4 倍

vllm serve ibm-granite/granite-docling-258M \
    --host 127.0.0.1 \
    --port 8000 \
    --gpu-memory-utilization 0.9

Windows (llama‑server)

.\llama-server.exe --hf-repo ibm-granite/granite-docling-258M-GGUF -ngl -1 --port 8000

💡 快速故障排除技巧

import torch

print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")

上面的脚本会检查 CUDA 是否可用，并打印检测到的 GPU 名称。利用这些信息来确认显存（VRAM）是否可用，并相应调整批量大小，以实现最佳吞吐量。

自动优化脚本

import torch
from docling.document_converter import DocumentConverter
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
from docling.datamodel.pipeline_options import ThreadedPdfPipelineOptions


def get_optimal_settings():
    """Detect GPU and choose appropriate batch sizes."""
    if not torch.cuda.is_available():
        print("CUDA not found. Falling back to CPU.")
        return None, None

    # Determine VRAM to pick the best batch size
    vram_gb = torch.cuda.get_device_properties(0).total_memory / 1024**3
    print(f"Detected GPU: {torch.cuda.get_device_name(0)} ({vram_gb:.2f} GB VRAM)")

    # Tuning logic based on hardware tiers
    if vram_gb > 24:          # e.g., RTX 5090 (32 GB)
        b_size = 128
    elif vram_gb >= 20:       # e.g., RTX 4090 (24 GB)
        b_size = 64
    else:                     # e.g., RTX 5070 (12 GB) or lower
        b_size = 16

    acc_options = AcceleratorOptions(device=AcceleratorDevice.CUDA)

    pipe_options = ThreadedPdfPipelineOptions(
        ocr_batch_size=b_size,
        layout_batch_size=b_size,
        table_batch_size=4   # Tables are memory‑intensive
    )

    return acc_options, pipe_options


# Initialise with optimized settings
acc_opts, pipe_opts = get_optimal_settings()

converter = DocumentConverter(
    accelerator_options=acc_opts,
    pipeline_options=pipe_opts
)

# Example conversion
result = converter.convert("document.pdf")

就这样！只要配备合适的 NVIDIA RTX GPU 并按照几个简单步骤操作，Docling 就能以前所未有的速度处理海量文档集合。 🚀

文档转换示例

converter = DocumentConverter(
    accelerator_options=acc_opts,
    pipeline_options=pipe_opts,
)

# Convert your document
result = converter.convert("large_document.pdf")
print("Conversion complete!")

更简化示例（无检测）

import datetime
import logging
import time
from pathlib import Path

import numpy as np
from pydantic import TypeAdapter

from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
from docling.datamodel.base_models import ConversionStatus, InputFormat
from docling.datamodel.pipeline_options import ThreadedPdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.pipeline.threaded_standard_pdf_pipeline import ThreadedStandardPdfPipeline
from docling.utils.profiling import ProfilingItem

_log = logging.getLogger(__name__)

def main() -> None:
    # Reduce noise from the library logger
    logging.getLogger("docling").setLevel(logging.WARNING)
    _log.setLevel(logging.INFO)

    data_folder = Path(__file__).parent / "../../tests/data"
    # input_doc_path = data_folder / "pdf" / "2305.03393v1.pdf"  # 14 pages
    input_doc_path = data_folder / "pdf" / "redp5110_sampled.pdf"  # 18 pages

    pipeline_options = ThreadedPdfPipelineOptions(
        accelerator_options=AcceleratorOptions(device=AcceleratorDevice.CUDA),
        ocr_batch_size=4,
        layout_batch_size=64,
        table_batch_size=4,
    )
    pipeline_options.do_ocr = False

    doc_converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_cls=ThreadedStandardPdfPipeline,
                pipeline_options=pipeline_options,
            )
        }
    )

    # Initialise pipeline
    start_time = time.time()
    doc_converter.initialize_pipeline(InputFormat.PDF)
    init_runtime = time.time() - start_time
    _log.info(f"Pipeline initialized in {init_runtime:.2f} seconds.")

    # Convert document
    start_time = time.time()
    conv_result = doc_converter.convert(input_doc_path)
    pipeline_runtime = time.time() - start_time
    assert conv_result.status == ConversionStatus.SUCCESS

    num_pages = len(conv_result.pages)
    _log.info(f"Document converted in {pipeline_runtime:.2f} seconds.")
    _log.info(f"  {num_pages / pipeline_runtime:.2f} pages/second.")

if __name__ == "__main__":
    main()

最大化 GPU 利用率的技巧

内存监控 – 在脚本运行时执行 nvidia-smi -l 1 以观察显存使用情况。
Linux 上的 vLLM – 与 Windows 相比，vLLM 管道在 Linux 上为视觉语言模型（VLM）提供约 4× 的性能提升。
清理缓存 – 在处理大量大型文件时，在转换之间调用 torch.cuda.empty_cache() 以避免“内存不足”错误。

为什么使用专用的 vLLM 服务器？

RTX 5090 的 32 GB GDDR7 显存只有在服务器端部署 vLLM 时才能得到充分利用。此配置可为诸如 granite‑docling‑258M 等模型提供高达 4× 的加速。

启动 vLLM 服务器（针对 32 GB VRAM 进行优化）

vllm serve ibm-granite/granite-docling-258M \
  --revision untied \
  --host 127.0.0.1 \
  --port 8000 \
  --gpu-memory-utilization 0.9 \
  --max-num-seqs 1024 \
  --max-num-batched-tokens 16384 \
  --enable-chunked-prefill

参数说明

参数	原因
`--revision untied`	为了兼容当前的 vLLM 版本以及 granite‑docling 架构，需要使用此修订。
`--gpu-memory-utilization 0.9`	将 32 GB VRAM 的 90 % 分配给模型 + KV 缓存。
`--max-num-seqs 1024`	利用 RTX 5090 的海量核心数，实现高并行序列处理。
`--max-num-batched-tokens 16384`	允许大批量推理而不会崩溃。
`--enable-chunked-prefill`	使用 PagedAttention 加速“prefill”（读取文档页面）。

提示： 如果在处理非常复杂的文档时出现 OOM 错误，请将 --gpu-memory-utilization 降低到 0.8。

将 Docling 连接到 vLLM 服务器

from docling.document_converter import DocumentConverter
from docling.datamodel.pipeline_options import PdfPipelineOptions, VlmOptions

# 1. Configure the VLM to point to your local vLLM server
vlm_options = VlmOptions(
    server_url="http://127.0.0.1:8000/v1",
    model_id="ibm-granite/granite-docling-258M",
)

# 2. Set the pipeline to use the server‑based VLM
pipeline_options = PdfPipelineOptions()
pipeline_options.vlm_options = vlm_options

# 3. Initialise the converter
converter = DocumentConverter(pipeline_options=pipeline_options)

# 4. Run high‑speed conversion
result = converter.convert("massive_report.pdf")
print(result.document.export_to_markdown())

关键优势

大规模批处理 – vLLM 的 PagedAttention 使 RTX 5090 能够处理比标准推理大得多的页面批次。
GDDR7 速度 – 更高的内存带宽加速预填充阶段（读取每页）。
Blackwell 架构 – 利用针对 50 系列 GPU 的 CUDA 12.8 优化，避免传统模式的惩罚。

有用的链接

[NVIDIA Driver Downloads]()
[NVIDIA CUDA Downloads]()
[NVIDIA cuDNN Installation]()
[Python Compatibility Matrix (PyTorch)]()
[Llama.cpp Repository]()

⚡️ 为您的文档工作流加速：Docling 现在释放 NVIDIA RTX 的强大性能！

什么是 NVIDIA RTX？

为什么在 Docling 中使用 RTX？

快速设置

1. 验证硬件

2. 安装带 CUDA 支持的 PyTorch

3. 运行 Docling

GPU‑特定批量大小建议

操作系统特定指南

视觉语言模型 (VLM) 推理

Linux (vLLM) – 比 `llama-server` 快约 4 倍

Windows (llama‑server)

💡 快速故障排除技巧

自动优化脚本

文档转换示例

更简化示例（无检测）

最大化 GPU 利用率的技巧

为什么使用专用的 vLLM 服务器？

启动 vLLM 服务器（针对 32 GB VRAM 进行优化）

参数说明

将 Docling 连接到 vLLM 服务器

关键优势

更多资源

有用的链接

相关文章

衡量关键指标：向 AWS Lambda Powertools 添加多个维度集

精通 Django 图像迁移：本地到 S3、CDN 以及更远！

沉默的注册杀手：当 Auto-Formatter 与 Linter 碰撞时

FastAPI 从零开始：编写你的第一个 API 路由

什么是 NVIDIA RTX？

为什么在 Docling 中使用 RTX？

快速设置

1. 验证硬件

2. 安装带 CUDA 支持的 PyTorch

3. 运行 Docling

GPU‑特定批量大小建议

操作系统特定指南

视觉语言模型 (VLM) 推理

Linux (vLLM) – 比 llama-server 快约 4 倍

Windows (llama‑server)

💡 快速故障排除技巧

自动优化脚本

文档转换示例

更简化示例（无检测）

最大化 GPU 利用率的技巧

为什么使用专用的 vLLM 服务器？

启动 vLLM 服务器（针对 32 GB VRAM 进行优化）

参数说明

将 Docling 连接到 vLLM 服务器

关键优势

更多资源

有用的链接

相关文章

衡量关键指标：向 AWS Lambda Powertools 添加多个维度集

精通 Django 图像迁移：本地到 S3、CDN 以及更远！

沉默的注册杀手：当 Auto-Formatter 与 Linter 碰撞时

FastAPI 从零开始：编写你的第一个 API 路由

什么是 NVIDIA RTX？

Linux (vLLM) – 比 `llama-server` 快约 4 倍

启动 vLLM 服务器（针对 32 GB VRAM 进行优化）