[Paper] Cornserve: Efficiently Serving Any-to-Any Multimodal Models
Source: arXiv - 2512.14098v1
Overview
Cornserve is a new serving system designed for the fast‑growing family of Any‑to‑Any multimodal models—models that can take any mix of text, images, video, or audio as input and produce any mix of those modalities as output. By letting developers describe a model’s computation graph once and then automatically generating an optimized deployment plan, Cornserve bridges the gap between the flexibility of these models and the practical constraints of production inference.
Key Contributions
- Unified graph description: A simple DSL for developers to declare heterogeneous components (encoders, LLMs, diffusion generators, etc.) in a single model graph.
- Automatic planning & disaggregation: The planner decides whether to keep the model monolithic or split it into smaller services, based on workload patterns and component characteristics.
- Heterogeneity‑aware runtime: A distributed execution engine that schedules mixed‑modality sub‑tasks, balances GPU/CPU resources, and pipelines data across components.
- Performance gains: Empirical results show up to 3.81× higher throughput and 5.79× lower tail latency compared with existing serving stacks.
- Generality: Works across a wide range of Any‑to‑Any models, from text‑to‑image diffusion pipelines to video‑question‑answering systems.
Methodology
- Model Graph Specification – Developers write a lightweight description (similar to a DAG) that lists each stage: e.g., “image encoder → multimodal transformer → diffusion decoder”.
- Planner Phase –
- Profiling: Cornserve runs a quick offline benchmark to measure compute cost, memory footprint, and data transfer size for each component.
- Cost Model: It combines these measurements with the expected request mix (e.g., 40 % text‑to‑image, 20 % audio‑to‑text) to estimate overall latency and resource usage.
- Optimization: Using a mixed‑integer linear program, the planner decides:
- Which components stay together on the same device.
- Which should be split into separate micro‑services.
- How many replicas each service needs.
- Distributed Runtime – At inference time, a request router parses the incoming modality combination, looks up the pre‑computed plan, and dispatches sub‑tasks to the appropriate workers. The runtime handles:
- Heterogeneous hardware (GPU for diffusion, CPU for lightweight encoders).
- Pipelining to overlap compute and data movement.
- Dynamic scaling when request patterns shift.
The whole pipeline is built on top of existing container orchestration (Kubernetes) and inference frameworks (TensorRT, PyTorch Serve), so developers can adopt it without rewriting model code.
Results & Findings
| Scenario | Baseline (single‑service) | Cornserve | Speed‑up | Tail‑latency ↓ |
|---|---|---|---|---|
| Text‑to‑Image (Stable Diffusion) | 45 req/s | 172 req/s | 3.81× | 5.79× |
| Audio‑to‑Text (Whisper + LLM) | 30 req/s | 92 req/s | 3.07× | 4.2× |
| Video‑Q&A (ViT encoder + LLM) | 12 req/s | 34 req/s | 2.83× | 3.9× |
Key takeaways
- Component‑level scaling (e.g., replicating only the diffusion decoder) yields far better resource utilization than scaling the whole monolithic model.
- Cross‑modality pipelining reduces idle GPU time, especially when a request mixes cheap encoders with expensive generators.
- The planner’s decisions remain stable across typical workload fluctuations, and the runtime can re‑plan on the fly with minimal disruption.
Practical Implications
- Faster product features: Teams building AI‑powered editors, chat assistants, or content generation tools can serve richer multimodal interactions without over‑provisioning hardware.
- Cost savings: By allocating GPUs only to the heavy‑weight stages, cloud spend can drop dramatically—especially for bursty workloads where only a subset of components is needed.
- Simplified ops: Engineers no longer need to hand‑craft micro‑service boundaries for each new multimodal model; Cornserve’s planner does it automatically.
- Future‑proofing: As new Any‑to‑Any architectures (e.g., audio‑to‑video diffusion) appear, they can be plugged into the same serving stack with minimal code changes.
Limitations & Future Work
- Static profiling assumptions: The planner relies on offline benchmarks; sudden changes in input size (e.g., ultra‑high‑resolution images) may degrade the optimality of the plan.
- Hardware diversity: Current experiments focus on GPU‑centric clusters; extending the runtime to heterogeneous edge devices (TPUs, NPUs) is left for later work.
- Model‑specific optimizations: Some models benefit from custom kernels or quantization that Cornserve does not yet expose automatically.
- Dynamic workload adaptation: While re‑planning is supported, the latency of re‑optimization could be improved for ultra‑low‑latency services.
Overall, Cornserve demonstrates that a systematic, graph‑aware approach to serving can unlock the performance potential of today’s most flexible multimodal AI systems, making them viable for real‑world products.
Authors
- Jeff J. Ma
- Jae-Won Chung
- Jisang Ahn
- Yizhuo Liang
- Akshay Jajoo
- Myungjin Lee
- Mosharaf Chowdhury
Paper Information
- arXiv ID: 2512.14098v1
- Categories: cs.LG, cs.DC
- Published: December 16, 2025
- PDF: Download PDF