[Paper] Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs

Published: 5 days ago (May 5, 2026 at 07:25 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.04357v1

Overview

The paper presents Coral, a system that lets cloud operators serve many large language models (LLMs) at once while taking full advantage of the diverse mix of GPUs that cloud providers expose (old‑gen, mid‑tier, and the latest accelerators). By jointly deciding where each model replica runs and how it is served, Coral cuts the cost of multi‑LLM inference by up to 2.8× and boosts usable throughput (goodput) by more than 2× when GPU resources are scarce.

Key Contributions

Heterogeneity‑aware scheduling: A unified optimizer that simultaneously selects GPU types and placement for every model replica, rather than treating each model or hardware class in isolation.
Two‑stage lossless decomposition: The original mixed‑integer problem is split into a fast offline “capacity planning” phase and a lightweight online “allocation” phase, preserving optimality while shrinking solve time from hours to seconds.
Adaptive runtime engine: Coral continuously monitors demand and resource availability, re‑optimizing on‑the‑fly without disrupting in‑flight requests.
Comprehensive evaluation: Experiments on six popular LLMs (e.g., Llama‑2, Falcon) across 20 distinct GPU configurations demonstrate up to 2.79× cost reduction and 2.39× goodput improvement versus strong baselines.
Open‑source prototype: The authors release a prototype implementation that can be plugged into existing inference serving stacks (e.g., Triton, vLLM).

Methodology

Model & Hardware Profiling – Each LLM is benchmarked on every GPU type to build a performance‑per‑dollar matrix (tokens‑per‑second vs. cost).
Joint Optimization Formulation – The serving problem is expressed as a mixed‑integer program that decides:
- How many replicas of each model to run,
- Which GPU class each replica should occupy,
- The request routing policy that maps incoming queries to replicas.
Two‑Stage Decomposition
- Stage 1 (Offline): Compute a Pareto frontier of feasible replica‑count vectors that satisfy any possible demand pattern. This step is done once per deployment and takes minutes.
- Stage 2 (Online): Given the current demand snapshot, pick the best point on the pre‑computed frontier and instantly generate the concrete placement and routing plan.
Adaptive Loop – A lightweight controller watches request rates and GPU health; when a significant shift is detected, it triggers Stage 2 re‑optimization.
Implementation – Coral sits atop a container‑orchestrated inference service, using Kubernetes custom resources to spin up/tear down GPU pods as dictated by the optimizer.

Results & Findings

Metric	Baseline (static best‑fit)	Coral
Cost per 1 M tokens	$0.48	$0.17 (2.79× cheaper)
Goodput (tokens/s) under 30 % GPU headroom	1.2 M	2.9 M (2.39× higher)
Optimization latency	~2 h (full MILP)	≈ 15 s (online)
Scalability	Handles up to 4 models	Handles 6+ models, 20 GPU types without degradation

Key takeaways:

Heterogeneous GPUs are not a liability – older GPUs can still be profitably used when the optimizer matches them to models that are less memory‑intensive.
Joint planning beats per‑model heuristics – treating each model independently leads to up to 40 % higher cost because of fragmented GPU utilization.
Fast re‑optimization enables elasticity – Coral reacts to demand spikes within seconds, keeping latency SLAs intact.

Practical Implications

Cloud‑native inference platforms (e.g., AWS SageMaker, Azure ML, GCP Vertex AI) can integrate Coral’s scheduler to automatically lower operating expenses without manual GPU selection.
Start‑ups and SaaS providers that need to expose multiple LLM APIs (chat, summarization, code generation) can run them on cheaper, mixed‑generation GPU fleets, freeing budget for product development.
Edge‑to‑cloud hybrid deployments can adopt the same principle: allocate newer, high‑throughput GPUs to latency‑critical models, and push batch‑oriented models to older hardware.
DevOps tooling – The two‑stage decomposition can be wrapped into a CI/CD pipeline that re‑generates the offline frontier whenever a new model version or GPU type is added, keeping the system future‑proof.

Limitations & Future Work

Static profiling assumption: Coral relies on accurate per‑GPU performance numbers; sudden driver updates or model quantization changes could invalidate the matrix and require re‑profiling.
Limited to token‑level throughput metrics: The current optimizer does not directly consider latency‑critical tail‑latency guarantees, which may be needed for real‑time chat use‑cases.
GPU memory fragmentation: While the system handles heterogeneous capacities, it does not yet support dynamic memory partitioning (e.g., model offloading) that could further increase packing density.
Future directions suggested by the authors include extending the framework to multi‑node GPU clusters, incorporating CPU‑offload strategies, and adding SLA‑aware latency constraints to the optimization model.

Authors

Yixuan Mei
Zikun Li
Zixuan Chen
Shiqi Pan
Mengdi Wu
Xupeng Miao
Zhihao Jia
K. V. Rashmi

Paper Information

arXiv ID: 2605.04357v1
Categories: cs.DC, cs.AI, cs.CL, cs.LG
Published: May 5, 2026
PDF: Download PDF

[Paper] Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

[Paper] CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation

[Paper] Fast Byte Latent Transformer

[Paper] Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims