[Paper] Enabling Disaggregated Multi-Stage MLLM Inference via GPU-Internal Scheduling and Resource Sharing

Published: 1 month ago (December 19, 2025 at 08:40 AM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.17574v1

Overview

This paper tackles the hidden performance bottlenecks that appear when deploying multimodal large language models (MLLMs) – LLMs that can understand images and video. By redesigning how video decoding and the vision‑encoder stage are scheduled on GPUs, the authors achieve up to 3× more requests and 4.4× higher throughput compared with existing pipelines, making latency‑sensitive MLLM services far more practical for real‑world applications.

Key Contributions

FlashCodec – a collaborative multi‑GPU video decoder that keeps decoding latency low while still delivering high throughput, eliminating the CPU‑bound bottleneck that dominates Time‑to‑First‑Token (TTFT).
UnifiedServe – a GPU‑internal scheduler that logically separates the vision‑encoder and LLM inference stages but physically shares GPU compute and memory, removing inter‑stage blocking and improving overall utilization.
End‑to‑end stack that combines both techniques, delivering up to 3.0× more concurrent requests or 1.5× tighter SLOs with 4.4× higher throughput versus the best prior systems.
Comprehensive evaluation on real video‑question answering workloads showing consistent gains across different model sizes and hardware configurations.

Methodology

Profiling the MLLM pipeline – The authors first break down the three‑stage workflow (multimodal preprocessing → vision encoder → LLM inference) and measure where latency spikes occur.
FlashCodec design
- Splits video frames across multiple GPUs.
- Uses a lightweight intra‑GPU communication layer to stitch decoded frames back together.
- Keeps the decoder on‑GPU to avoid costly CPU‑GPU data transfers.
UnifiedServe scheduler
- Introduces a logical decoupling: the vision encoder and LLM inference are treated as independent tasks in a dependency graph.
- Implements physical sharing: both tasks run on the same GPU, with fine‑grained time‑slicing and memory partitioning so that idle resources from one stage can be reclaimed by the other.
- Employs a lightweight priority scheme to guarantee that the latency‑critical LLM decoding step is never starved.
Integration & evaluation – The two components are combined into a single serving stack and benchmarked on a cluster of NVIDIA A100 GPUs using popular video‑QA datasets (e.g., MS‑VQA, ActivityNet‑QA).

Results & Findings

Metric	Baseline (CPU decode + separate GPUs)	FlashCodec + UnifiedServe
TTFT (first token latency)	1.8 s	0.9 s (≈ 2× faster)
Throughput (queries / s)	12	52 (≈ 4.4×)
Max concurrent requests under 2 s SLO	30	90 (≈ 3×)
GPU utilization (average)	38 %	78 %

The gains come primarily from:

Eliminating CPU‑GPU transfer overhead during video decoding.
Overlapping vision‑encoder compute with LLM prefill/decoding via UnifiedServe’s shared‑GPU scheduling.
Better memory packing, allowing larger batches of visual embeddings to stay resident on‑GPU.

Practical Implications

Lower latency for interactive AI assistants that need to process video clips on‑the‑fly (e.g., real‑time video chat, AR/VR guidance).
Higher request density per GPU, meaning cloud providers can serve more customers with the same hardware budget, reducing cost per token.
Simplified deployment: developers no longer need separate CPU‑heavy decoding services; a single GPU node can handle the full MLLM stack.
Scalable to larger models – because UnifiedServe dynamically reallocates GPU memory, it can accommodate future vision encoders that are even more compute‑intensive without redesigning the serving infrastructure.

Limitations & Future Work

Hardware dependence: FlashCodec assumes multiple GPUs with high‑speed NVLink or PCIe interconnects; performance may degrade on single‑GPU or low‑bandwidth setups.
Video codec support: The current implementation focuses on H.264/H.265; extending to newer codecs (AV1, VVC) will require additional engineering.
Scheduler overhead: While lightweight, the fine‑grained time‑slicing adds a small constant overhead that could become noticeable for ultra‑low‑latency (< 100 ms) use cases.
Future directions suggested by the authors include: integrating on‑GPU video compression to further reduce memory traffic, exploring adaptive batch sizing based on runtime load, and generalizing UnifiedServe to other heterogeneous pipelines (e.g., audio‑to‑text models).

Authors

Lingxiao Zhao
Haoran Zhou
Yuezhi Che
Dazhao Cheng

Paper Information

arXiv ID: 2512.17574v1
Categories: cs.DC, cs.LG
Published: December 19, 2025
PDF: Download PDF

[Paper] Enabling Disaggregated Multi-Stage MLLM Inference via GPU-Internal Scheduling and Resource Sharing

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

[Paper] Adversarial Robustness of Vision in Open Foundation Models

[Paper] When Reasoning Meets Its Laws

[Paper] Distributionally Robust Imitation Learning: Layered Control Architecture for Certifiable Autonomy