⚡️ 문서 워크플로를 슈퍼차지하세요: Docling이 이제 NVIDIA RTX의 힘을 발휘합니다!

발행: 1개월 전 (2026년 1월 7일 오전 03:52 GMT+9)

10 분 소요

Source: Dev.to

NVIDIA RTX란?

NVIDIA RTX (Ray Tracing Texel eXtreme)는 실시간 레이 트레이싱과 인공지능을 위한 특수 하드웨어를 도입하여 디지털 렌더링을 혁신한 전문 시각‑컴퓨팅 플랫폼입니다. Blackwell, Ada Lovelace, 그리고 Ampere와 같은 최신 아키텍처를 기반으로 하며, RTX GPU는 다음과 같은 특징을 가집니다:

RT 코어 – 빛의 물리적 동작(레이 반사, 반사, 그림자)을 시뮬레이션합니다.
Tensor 코어 – AI 작업을 가속화합니다(예: 프레임 레이트 향상을 위한 DLSS).

시네마틱 게임을 넘어, RTX는 창작자와 연구자에게 막대한 성능 향상을 제공하여 신경망 렌더링 및 고처리량 데이터 처리를 가능하게 하며, 이는 전통적인 CPU 기반 워크플로보다 최대 6배 빠른 속도를 구현합니다.

RTX와 Docling을 사용하는 이유

CPU에서 무거운 작업을 NVIDIA RTX GPU로 옮김으로써 처리 시간을 최대 6배까지 단축할 수 있습니다. 이는 단순한 미세 조정이 아니라, 다음과 같은 작업 방식을 혁신하는 성능 도약입니다:

사용 사례	이점
대용량 배치	수천 페이지를 짧은 시간에 처리합니다.
고처리량 워크플로	생산 파이프라인을 번개 같은 속도로 유지합니다.
고급 모델	복잡한 문서‑이해 모델을 지연 없이 실험합니다.

Docling은 플러그‑앤‑플레이 방식으로 설계되었습니다. NVIDIA 드라이버, CUDA Toolkit 및 cuDNN을 설치하면 Docling이 자동으로 RTX GPU를 감지하고 사용합니다.

빠른 설정

1. 하드웨어 확인

nvidia-smi

표시된 드라이버 버전이 설치하려는 CUDA 버전과 일치하는지 확인하세요.

2. CUDA 지원 PyTorch 설치

URL을 CUDA 툴킷 버전에 맞는 것으로 교체하세요.

CUDA 12.8용

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

CUDA 13.0용

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130

3. Docling 실행

from docling.document_converter import DocumentConverter

converter = DocumentConverter()   # Automatically detects GPU!
result = converter.convert("document.pdf")

GPU‑특정 배치‑크기 권장 사항

RTX 모델	VRAM	제안된 OCR / 레이아웃 배치 크기
RTX 5090	32 GB	64 – 128
RTX 4090	24 GB	32 – 64
RTX 5070	12 GB	16 – 32

OS‑특정 가이드

Feature	Windows 10/11	Linux (Ubuntu/Debian, 등)
Driver install	NVIDIA 웹사이트에서 수동 다운로드.	`apt`/`dnf` 사용하거나 NVIDIA 사이트에서 다운로드.
Verification	PowerShell 또는 CMD에서 `nvidia-smi` 실행.	터미널에서 `nvidia-smi` 실행.
VLM inference	`llama-server` (llama.cpp) – 권장.	`vLLM` – 고성능 권장.
Max performance	WSL2(Windows Subsystem for Linux)를 통해 가능.	Linux에서 네이티브 성능.

Note: PyTorch 설치 명령은 두 플랫폼 모두 동일합니다; 설치한 드라이버와 일치하도록 CUDA 툴킷 버전을 확인하세요.

비전‑언어 모델 (VLM) 추론

Linux (vLLM) – `llama-server`보다 약 4× 빠름

vllm serve ibm-granite/granite-docling-258M \
    --host 127.0.0.1 \
    --port 8000 \
    --gpu-memory-utilization 0.9

Windows (llama‑server)

.\llama-server.exe --hf-repo ibm-granite/granite-docling-258M-GGUF -ngl -1 --port 8000

💡 빠른 문제 해결 팁

import torch

print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")

위 스크립트는 CUDA 사용 가능 여부를 확인하고 감지된 GPU의 이름을 출력합니다. 이 정보를 사용하여 VRAM 가용성을 확인하고 최적의 처리량을 위해 배치 크기를 조정하십시오.

자동 최적화 스크립트

import torch
from docling.document_converter import DocumentConverter
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
from docling.datamodel.pipeline_options import ThreadedPdfPipelineOptions


def get_optimal_settings():
    """Detect GPU and choose appropriate batch sizes."""
    if not torch.cuda.is_available():
        print("CUDA not found. Falling back to CPU.")
        return None, None

    # Determine VRAM to pick the best batch size
    vram_gb = torch.cuda.get_device_properties(0).total_memory / 1024**3
    print(f"Detected GPU: {torch.cuda.get_device_name(0)} ({vram_gb:.2f} GB VRAM)")

    # Tuning logic based on hardware tiers
    if vram_gb > 24:          # e.g., RTX 5090 (32 GB)
        b_size = 128
    elif vram_gb >= 20:       # e.g., RTX 4090 (24 GB)
        b_size = 64
    else:                     # e.g., RTX 5070 (12 GB) or lower
        b_size = 16

    acc_options = AcceleratorOptions(device=AcceleratorDevice.CUDA)

    pipe_options = ThreadedPdfPipelineOptions(
        ocr_batch_size=b_size,
        layout_batch_size=b_size,
        table_batch_size=4   # Tables are memory‑intensive
    )

    return acc_options, pipe_options


# Initialise with optimized settings
acc_opts, pipe_opts = get_optimal_settings()

converter = DocumentConverter(
    accelerator_options=acc_opts,
    pipeline_options=pipe_opts
)

# Example conversion
result = converter.convert("document.pdf")

그게 전부입니다! 올바른 NVIDIA RTX GPU와 몇 가지 간단한 단계만 있으면 Docling은 방대한 문서 컬렉션을 전례 없는 속도로 처리할 수 있습니다. 🚀

문서 변환 예제

converter = DocumentConverter(
    accelerator_options=acc_opts,
    pipeline_options=pipe_opts,
)

# Convert your document
result = converter.convert("large_document.pdf")
print("Conversion complete!")

간단한 예제 (감지 없음)

import datetime
import logging
import time
from pathlib import Path

import numpy as np
from pydantic import TypeAdapter

from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
from docling.datamodel.base_models import ConversionStatus, InputFormat
from docling.datamodel.pipeline_options import ThreadedPdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.pipeline.threaded_standard_pdf_pipeline import ThreadedStandardPdfPipeline
from docling.utils.profiling import ProfilingItem

_log = logging.getLogger(__name__)

def main() -> None:
    # Reduce noise from the library logger
    logging.getLogger("docling").setLevel(logging.WARNING)
    _log.setLevel(logging.INFO)

    data_folder = Path(__file__).parent / "../../tests/data"
    # input_doc_path = data_folder / "pdf" / "2305.03393v1.pdf"  # 14 pages
    input_doc_path = data_folder / "pdf" / "redp5110_sampled.pdf"  # 18 pages

    pipeline_options = ThreadedPdfPipelineOptions(
        accelerator_options=AcceleratorOptions(device=AcceleratorDevice.CUDA),
        ocr_batch_size=4,
        layout_batch_size=64,
        table_batch_size=4,
    )
    pipeline_options.do_ocr = False

    doc_converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_cls=ThreadedStandardPdfPipeline,
                pipeline_options=pipeline_options,
            )
        }
    )

    # Initialise pipeline
    start_time = time.time()
    doc_converter.initialize_pipeline(InputFormat.PDF)
    init_runtime = time.time() - start_time
    _log.info(f"Pipeline initialized in {init_runtime:.2f} seconds.")

    # Convert document
    start_time = time.time()
    conv_result = doc_converter.convert(input_doc_path)
    pipeline_runtime = time.time() - start_time
    assert conv_result.status == ConversionStatus.SUCCESS

    num_pages = len(conv_result.pages)
    _log.info(f"Document converted in {pipeline_runtime:.2f} seconds.")
    _log.info(f"  {num_pages / pipeline_runtime:.2f} pages/second.")

if __name__ == "__main__":
    main()

GPU 활용 극대화를 위한 팁

메모리 모니터링 – 스크립트 실행 중 nvidia-smi -l 1을 실행하여 VRAM 사용량을 확인합니다.
Linux에서 vLLM – vLLM 파이프라인은 Windows에 비해 Linux에서 Vision‑Language Models (VLMs)의 성능을 대략 4배 향상시킵니다.
캐시 정리 – 많은 대용량 파일을 처리할 때 변환 사이에 torch.cuda.empty_cache()를 호출하여 “Out of Memory” 오류를 방지합니다.

전용 vLLM 서버를 사용하는 이유

RTX 5090의 32 GB GDDR7 VRAM은 서버‑사이드 vLLM 배포를 통해서만 완전히 활용할 수 있습니다. 이 설정을 사용하면 granite‑docling‑258M과 같은 모델에서 최대 4배의 속도 향상을 기대할 수 있습니다.

vLLM 서버 실행 (32 GB VRAM 최적화)

vllm serve ibm-granite/granite-docling-258M \
  --revision untied \
  --host 127.0.0.1 \
  --port 8000 \
  --gpu-memory-utilization 0.9 \
  --max-num-seqs 1024 \
  --max-num-batched-tokens 16384 \
  --enable-chunked-prefill

플래그 설명

플래그	이유
`--revision untied`	현재 vLLM 버전 및 granite‑docling 아키텍처와의 호환성을 위해 필요합니다.
`--gpu-memory-utilization 0.9`	32 GB VRAM의 90 %를 모델 + KV 캐시에 할당합니다.
`--max-num-seqs 1024`	RTX 5090의 방대한 코어 수를 활용해 고병렬 시퀀스 처리를 가능하게 합니다.
`--max-num-batched-tokens 16384`	크래시 없이 대규모 배치 추론을 지원합니다.
`--enable-chunked-prefill`	PagedAttention을 사용해 더 빠른 “prefill”(문서 페이지 읽기)을 수행합니다.

Tip: 매우 복잡한 문서에서 OOM 오류가 발생하면 --gpu-memory-utilization을 0.8로 낮추세요.

vLLM 서버에 Docling 연결하기

from docling.document_converter import DocumentConverter
from docling.datamodel.pipeline_options import PdfPipelineOptions, VlmOptions

# 1. VLM을 로컬 vLLM 서버를 가리키도록 설정
vlm_options = VlmOptions(
    server_url="http://127.0.0.1:8000/v1",
    model_id="ibm-granite/granite-docling-258M",
)

# 2. 파이프라인을 서버 기반 VLM 사용하도록 설정
pipeline_options = PdfPipelineOptions()
pipeline_options.vlm_options = vlm_options

# 3. 컨버터 초기화
converter = DocumentConverter(pipeline_options=pipeline_options)

# 4. 고속 변환 실행
result = converter.convert("massive_report.pdf")
print(result.document.export_to_markdown())

주요 장점

대규모 배치 처리 – vLLM의 PagedAttention은 RTX 5090이 표준 추론보다 훨씬 큰 페이지 배치를 처리하도록 해줍니다.
GDDR7 속도 – 높은 메모리 대역폭이 프리필 단계(각 페이지 읽기)를 가속합니다.
Blackwell 아키텍처 – 50‑시리즈 GPU에 특화된 CUDA 12.8 최적화를 활용하여 레거시 모드 패널티를 피합니다.

추가 자료

원본 launch‑command 가이드 – Link
Docling 문서 – Link
Docling 프로젝트 저장소 – Link
GPU 지원 개요 – Link
GPU 성능 예시 – Link

레벨업을 준비했나요? 더 많은 예시와 문제 해결 팁을 보려면 Docling GPU 지원 가이드를 확인하세요.

유용한 링크

[NVIDIA 드라이버 다운로드]()
[NVIDIA CUDA 다운로드]()
[NVIDIA cuDNN 설치]()
[Python 호환성 매트릭스 (PyTorch)]()
[Llama.cpp 저장소]()

⚡️ 문서 워크플로를 슈퍼차지하세요: Docling이 이제 NVIDIA RTX의 힘을 발휘합니다!

NVIDIA RTX란?

RTX와 Docling을 사용하는 이유

빠른 설정

1. 하드웨어 확인

2. CUDA 지원 PyTorch 설치

3. Docling 실행

GPU‑특정 배치‑크기 권장 사항

OS‑특정 가이드

비전‑언어 모델 (VLM) 추론

Linux (vLLM) – `llama-server`보다 약 4× 빠름

Windows (llama‑server)

💡 빠른 문제 해결 팁

자동 최적화 스크립트

문서 변환 예제

간단한 예제 (감지 없음)

GPU 활용 극대화를 위한 팁

전용 vLLM 서버를 사용하는 이유

vLLM 서버 실행 (32 GB VRAM 최적화)

플래그 설명

vLLM 서버에 Docling 연결하기

주요 장점

추가 자료

유용한 링크

관련 글

중요한 것을 측정하기: AWS Lambda Powertools에 다중 차원 세트 추가

Django 이미지 마이그레이션 마스터하기: 로컬에서 S3, CDN까지, 그리고 그 이상!

조용한 등록 킬러: 자동 포매터와 린터가 충돌할 때

FastAPI 제로부터: 첫 API 라우트 작성

NVIDIA RTX란?

RTX와 Docling을 사용하는 이유

빠른 설정

1. 하드웨어 확인

2. CUDA 지원 PyTorch 설치

3. Docling 실행

GPU‑특정 배치‑크기 권장 사항

OS‑특정 가이드

비전‑언어 모델 (VLM) 추론

Linux (vLLM) – llama-server보다 약 4× 빠름

Windows (llama‑server)

💡 빠른 문제 해결 팁

자동 최적화 스크립트

문서 변환 예제

간단한 예제 (감지 없음)

GPU 활용 극대화를 위한 팁

전용 vLLM 서버를 사용하는 이유

vLLM 서버 실행 (32 GB VRAM 최적화)

플래그 설명

vLLM 서버에 Docling 연결하기

주요 장점

추가 자료

유용한 링크

관련 글

중요한 것을 측정하기: AWS Lambda Powertools에 다중 차원 세트 추가

Django 이미지 마이그레이션 마스터하기: 로컬에서 S3, CDN까지, 그리고 그 이상!

조용한 등록 킬러: 자동 포매터와 린터가 충돌할 때

FastAPI 제로부터: 첫 API 라우트 작성

NVIDIA RTX란?

Linux (vLLM) – `llama-server`보다 약 4× 빠름

vLLM 서버 실행 (32 GB VRAM 최적화)