[Paper] Enabling Disaggregated Multi-Stage MLLM Inference via GPU-Internal Scheduling and Resource Sharing
Source: arXiv - 2512.17574v1
Overview
This paper tackles the hidden performance bottlenecks that appear when deploying multimodal large language models (MLLMs) – LLMs that can understand images and video. By redesigning how video decoding and the vision‑encoder stage are scheduled on GPUs, the authors achieve up to 3× more requests and 4.4× higher throughput compared with existing pipelines, making latency‑sensitive MLLM services far more practical for real‑world applications.
Key Contributions
- FlashCodec – a collaborative multi‑GPU video decoder that keeps decoding latency low while still delivering high throughput, eliminating the CPU‑bound bottleneck that dominates Time‑to‑First‑Token (TTFT).
- UnifiedServe – a GPU‑internal scheduler that logically separates the vision‑encoder and LLM inference stages but physically shares GPU compute and memory, removing inter‑stage blocking and improving overall utilization.
- End‑to‑end stack that combines both techniques, delivering up to 3.0× more concurrent requests or 1.5× tighter SLOs with 4.4× higher throughput versus the best prior systems.
- Comprehensive evaluation on real video‑question answering workloads showing consistent gains across different model sizes and hardware configurations.
Methodology
- Profiling the MLLM pipeline – The authors first break down the three‑stage workflow (multimodal preprocessing → vision encoder → LLM inference) and measure where latency spikes occur.
- FlashCodec design
- Splits video frames across multiple GPUs.
- Uses a lightweight intra‑GPU communication layer to stitch decoded frames back together.
- Keeps the decoder on‑GPU to avoid costly CPU‑GPU data transfers.
- UnifiedServe scheduler
- Introduces a logical decoupling: the vision encoder and LLM inference are treated as independent tasks in a dependency graph.
- Implements physical sharing: both tasks run on the same GPU, with fine‑grained time‑slicing and memory partitioning so that idle resources from one stage can be reclaimed by the other.
- Employs a lightweight priority scheme to guarantee that the latency‑critical LLM decoding step is never starved.
- Integration & evaluation – The two components are combined into a single serving stack and benchmarked on a cluster of NVIDIA A100 GPUs using popular video‑QA datasets (e.g., MS‑VQA, ActivityNet‑QA).
Results & Findings
| Metric | Baseline (CPU decode + separate GPUs) | FlashCodec + UnifiedServe |
|---|---|---|
| TTFT (first token latency) | 1.8 s | 0.9 s (≈ 2× faster) |
| Throughput (queries / s) | 12 | 52 (≈ 4.4×) |
| Max concurrent requests under 2 s SLO | 30 | 90 (≈ 3×) |
| GPU utilization (average) | 38 % | 78 % |
The gains come primarily from:
- Eliminating CPU‑GPU transfer overhead during video decoding.
- Overlapping vision‑encoder compute with LLM prefill/decoding via UnifiedServe’s shared‑GPU scheduling.
- Better memory packing, allowing larger batches of visual embeddings to stay resident on‑GPU.
Practical Implications
- Lower latency for interactive AI assistants that need to process video clips on‑the‑fly (e.g., real‑time video chat, AR/VR guidance).
- Higher request density per GPU, meaning cloud providers can serve more customers with the same hardware budget, reducing cost per token.
- Simplified deployment: developers no longer need separate CPU‑heavy decoding services; a single GPU node can handle the full MLLM stack.
- Scalable to larger models – because UnifiedServe dynamically reallocates GPU memory, it can accommodate future vision encoders that are even more compute‑intensive without redesigning the serving infrastructure.
Limitations & Future Work
- Hardware dependence: FlashCodec assumes multiple GPUs with high‑speed NVLink or PCIe interconnects; performance may degrade on single‑GPU or low‑bandwidth setups.
- Video codec support: The current implementation focuses on H.264/H.265; extending to newer codecs (AV1, VVC) will require additional engineering.
- Scheduler overhead: While lightweight, the fine‑grained time‑slicing adds a small constant overhead that could become noticeable for ultra‑low‑latency (< 100 ms) use cases.
- Future directions suggested by the authors include: integrating on‑GPU video compression to further reduce memory traffic, exploring adaptive batch sizing based on runtime load, and generalizing UnifiedServe to other heterogeneous pipelines (e.g., audio‑to‑text models).
Authors
- Lingxiao Zhao
- Haoran Zhou
- Yuezhi Che
- Dazhao Cheng
Paper Information
- arXiv ID: 2512.17574v1
- Categories: cs.DC, cs.LG
- Published: December 19, 2025
- PDF: Download PDF