[Paper] Cornserve: Efficiently Serving Any-to-Any Multimodal Models

Published: 1 month ago (December 16, 2025 at 12:14 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.14098v1

Overview

Cornserve is a new serving system designed for the fast‑growing family of Any‑to‑Any multimodal models—models that can take any mix of text, images, video, or audio as input and produce any mix of those modalities as output. By letting developers describe a model’s computation graph once and then automatically generating an optimized deployment plan, Cornserve bridges the gap between the flexibility of these models and the practical constraints of production inference.

Key Contributions

Unified graph description: A simple DSL for developers to declare heterogeneous components (encoders, LLMs, diffusion generators, etc.) in a single model graph.
Automatic planning & disaggregation: The planner decides whether to keep the model monolithic or split it into smaller services, based on workload patterns and component characteristics.
Heterogeneity‑aware runtime: A distributed execution engine that schedules mixed‑modality sub‑tasks, balances GPU/CPU resources, and pipelines data across components.
Performance gains: Empirical results show up to 3.81× higher throughput and 5.79× lower tail latency compared with existing serving stacks.
Generality: Works across a wide range of Any‑to‑Any models, from text‑to‑image diffusion pipelines to video‑question‑answering systems.

Methodology

Model Graph Specification – Developers write a lightweight description (similar to a DAG) that lists each stage: e.g., “image encoder → multimodal transformer → diffusion decoder”.
Planner Phase –
- Profiling: Cornserve runs a quick offline benchmark to measure compute cost, memory footprint, and data transfer size for each component.
- Cost Model: It combines these measurements with the expected request mix (e.g., 40 % text‑to‑image, 20 % audio‑to‑text) to estimate overall latency and resource usage.
- Optimization: Using a mixed‑integer linear program, the planner decides:
  - Which components stay together on the same device.
  - Which should be split into separate micro‑services.
  - How many replicas each service needs.
Distributed Runtime – At inference time, a request router parses the incoming modality combination, looks up the pre‑computed plan, and dispatches sub‑tasks to the appropriate workers. The runtime handles:
- Heterogeneous hardware (GPU for diffusion, CPU for lightweight encoders).
- Pipelining to overlap compute and data movement.
- Dynamic scaling when request patterns shift.

The whole pipeline is built on top of existing container orchestration (Kubernetes) and inference frameworks (TensorRT, PyTorch Serve), so developers can adopt it without rewriting model code.

Results & Findings

Scenario	Baseline (single‑service)	Cornserve	Speed‑up	Tail‑latency ↓
Text‑to‑Image (Stable Diffusion)	45 req/s	172 req/s	3.81×	5.79×
Audio‑to‑Text (Whisper + LLM)	30 req/s	92 req/s	3.07×	4.2×
Video‑Q&A (ViT encoder + LLM)	12 req/s	34 req/s	2.83×	3.9×

Key takeaways

Component‑level scaling (e.g., replicating only the diffusion decoder) yields far better resource utilization than scaling the whole monolithic model.
Cross‑modality pipelining reduces idle GPU time, especially when a request mixes cheap encoders with expensive generators.
The planner’s decisions remain stable across typical workload fluctuations, and the runtime can re‑plan on the fly with minimal disruption.

Practical Implications

Faster product features: Teams building AI‑powered editors, chat assistants, or content generation tools can serve richer multimodal interactions without over‑provisioning hardware.
Cost savings: By allocating GPUs only to the heavy‑weight stages, cloud spend can drop dramatically—especially for bursty workloads where only a subset of components is needed.
Simplified ops: Engineers no longer need to hand‑craft micro‑service boundaries for each new multimodal model; Cornserve’s planner does it automatically.
Future‑proofing: As new Any‑to‑Any architectures (e.g., audio‑to‑video diffusion) appear, they can be plugged into the same serving stack with minimal code changes.

Limitations & Future Work

Static profiling assumptions: The planner relies on offline benchmarks; sudden changes in input size (e.g., ultra‑high‑resolution images) may degrade the optimality of the plan.
Hardware diversity: Current experiments focus on GPU‑centric clusters; extending the runtime to heterogeneous edge devices (TPUs, NPUs) is left for later work.
Model‑specific optimizations: Some models benefit from custom kernels or quantization that Cornserve does not yet expose automatically.
Dynamic workload adaptation: While re‑planning is supported, the latency of re‑optimization could be improved for ultra‑low‑latency services.

Overall, Cornserve demonstrates that a systematic, graph‑aware approach to serving can unlock the performance potential of today’s most flexible multimodal AI systems, making them viable for real‑world products.

Authors

Jeff J. Ma
Jae-Won Chung
Jisang Ahn
Yizhuo Liang
Akshay Jajoo
Myungjin Lee
Mosharaf Chowdhury

Paper Information

arXiv ID: 2512.14098v1
Categories: cs.LG, cs.DC
Published: December 16, 2025
PDF: Download PDF

[Paper] Cornserve: Efficiently Serving Any-to-Any Multimodal Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

[Paper] Adversarial Robustness of Vision in Open Foundation Models

[Paper] When Reasoning Meets Its Laws

[Paper] Distributionally Robust Imitation Learning: Layered Control Architecture for Certifiable Autonomy