[Paper] Enabling Disaggregated Multi-Stage MLLM Inference via GPU-Internal Scheduling and Resource Sharing

Published: (December 19, 2025 at 08:40 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.17574v1

Overview

This paper tackles the hidden performance bottlenecks that appear when deploying multimodal large language models (MLLMs) – LLMs that can understand images and video. By redesigning how video decoding and the vision‑encoder stage are scheduled on GPUs, the authors achieve up to 3× more requests and 4.4× higher throughput compared with existing pipelines, making latency‑sensitive MLLM services far more practical for real‑world applications.

Key Contributions

  • FlashCodec – a collaborative multi‑GPU video decoder that keeps decoding latency low while still delivering high throughput, eliminating the CPU‑bound bottleneck that dominates Time‑to‑First‑Token (TTFT).
  • UnifiedServe – a GPU‑internal scheduler that logically separates the vision‑encoder and LLM inference stages but physically shares GPU compute and memory, removing inter‑stage blocking and improving overall utilization.
  • End‑to‑end stack that combines both techniques, delivering up to 3.0× more concurrent requests or 1.5× tighter SLOs with 4.4× higher throughput versus the best prior systems.
  • Comprehensive evaluation on real video‑question answering workloads showing consistent gains across different model sizes and hardware configurations.

Methodology

  1. Profiling the MLLM pipeline – The authors first break down the three‑stage workflow (multimodal preprocessing → vision encoder → LLM inference) and measure where latency spikes occur.
  2. FlashCodec design
    • Splits video frames across multiple GPUs.
    • Uses a lightweight intra‑GPU communication layer to stitch decoded frames back together.
    • Keeps the decoder on‑GPU to avoid costly CPU‑GPU data transfers.
  3. UnifiedServe scheduler
    • Introduces a logical decoupling: the vision encoder and LLM inference are treated as independent tasks in a dependency graph.
    • Implements physical sharing: both tasks run on the same GPU, with fine‑grained time‑slicing and memory partitioning so that idle resources from one stage can be reclaimed by the other.
    • Employs a lightweight priority scheme to guarantee that the latency‑critical LLM decoding step is never starved.
  4. Integration & evaluation – The two components are combined into a single serving stack and benchmarked on a cluster of NVIDIA A100 GPUs using popular video‑QA datasets (e.g., MS‑VQA, ActivityNet‑QA).

Results & Findings

MetricBaseline (CPU decode + separate GPUs)FlashCodec + UnifiedServe
TTFT (first token latency)1.8 s0.9 s (≈ 2× faster)
Throughput (queries / s)1252 (≈ 4.4×)
Max concurrent requests under 2 s SLO3090 (≈ 3×)
GPU utilization (average)38 %78 %

The gains come primarily from:

  • Eliminating CPU‑GPU transfer overhead during video decoding.
  • Overlapping vision‑encoder compute with LLM prefill/decoding via UnifiedServe’s shared‑GPU scheduling.
  • Better memory packing, allowing larger batches of visual embeddings to stay resident on‑GPU.

Practical Implications

  • Lower latency for interactive AI assistants that need to process video clips on‑the‑fly (e.g., real‑time video chat, AR/VR guidance).
  • Higher request density per GPU, meaning cloud providers can serve more customers with the same hardware budget, reducing cost per token.
  • Simplified deployment: developers no longer need separate CPU‑heavy decoding services; a single GPU node can handle the full MLLM stack.
  • Scalable to larger models – because UnifiedServe dynamically reallocates GPU memory, it can accommodate future vision encoders that are even more compute‑intensive without redesigning the serving infrastructure.

Limitations & Future Work

  • Hardware dependence: FlashCodec assumes multiple GPUs with high‑speed NVLink or PCIe interconnects; performance may degrade on single‑GPU or low‑bandwidth setups.
  • Video codec support: The current implementation focuses on H.264/H.265; extending to newer codecs (AV1, VVC) will require additional engineering.
  • Scheduler overhead: While lightweight, the fine‑grained time‑slicing adds a small constant overhead that could become noticeable for ultra‑low‑latency (< 100 ms) use cases.
  • Future directions suggested by the authors include: integrating on‑GPU video compression to further reduce memory traffic, exploring adaptive batch sizing based on runtime load, and generalizing UnifiedServe to other heterogeneous pipelines (e.g., audio‑to‑text models).

Authors

  • Lingxiao Zhao
  • Haoran Zhou
  • Yuezhi Che
  • Dazhao Cheng

Paper Information

  • arXiv ID: 2512.17574v1
  • Categories: cs.DC, cs.LG
  • Published: December 19, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

[Paper] When Reasoning Meets Its Laws

Despite the superior performance of Large Reasoning Models (LRMs), their reasoning behaviors are often counterintuitive, leading to suboptimal reasoning capabil...