[Paper] Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs

Published: (May 5, 2026 at 07:25 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2605.04357v1

Overview

The paper presents Coral, a system that lets cloud operators serve many large language models (LLMs) at once while taking full advantage of the diverse mix of GPUs that cloud providers expose (old‑gen, mid‑tier, and the latest accelerators). By jointly deciding where each model replica runs and how it is served, Coral cuts the cost of multi‑LLM inference by up to 2.8× and boosts usable throughput (goodput) by more than when GPU resources are scarce.

Key Contributions

  • Heterogeneity‑aware scheduling: A unified optimizer that simultaneously selects GPU types and placement for every model replica, rather than treating each model or hardware class in isolation.
  • Two‑stage lossless decomposition: The original mixed‑integer problem is split into a fast offline “capacity planning” phase and a lightweight online “allocation” phase, preserving optimality while shrinking solve time from hours to seconds.
  • Adaptive runtime engine: Coral continuously monitors demand and resource availability, re‑optimizing on‑the‑fly without disrupting in‑flight requests.
  • Comprehensive evaluation: Experiments on six popular LLMs (e.g., Llama‑2, Falcon) across 20 distinct GPU configurations demonstrate up to 2.79× cost reduction and 2.39× goodput improvement versus strong baselines.
  • Open‑source prototype: The authors release a prototype implementation that can be plugged into existing inference serving stacks (e.g., Triton, vLLM).

Methodology

  1. Model & Hardware Profiling – Each LLM is benchmarked on every GPU type to build a performance‑per‑dollar matrix (tokens‑per‑second vs. cost).
  2. Joint Optimization Formulation – The serving problem is expressed as a mixed‑integer program that decides:
    • How many replicas of each model to run,
    • Which GPU class each replica should occupy,
    • The request routing policy that maps incoming queries to replicas.
  3. Two‑Stage Decomposition
    • Stage 1 (Offline): Compute a Pareto frontier of feasible replica‑count vectors that satisfy any possible demand pattern. This step is done once per deployment and takes minutes.
    • Stage 2 (Online): Given the current demand snapshot, pick the best point on the pre‑computed frontier and instantly generate the concrete placement and routing plan.
  4. Adaptive Loop – A lightweight controller watches request rates and GPU health; when a significant shift is detected, it triggers Stage 2 re‑optimization.
  5. Implementation – Coral sits atop a container‑orchestrated inference service, using Kubernetes custom resources to spin up/tear down GPU pods as dictated by the optimizer.

Results & Findings

MetricBaseline (static best‑fit)Coral
Cost per 1 M tokens$0.48$0.17 (2.79× cheaper)
Goodput (tokens/s) under 30 % GPU headroom1.2 M2.9 M (2.39× higher)
Optimization latency~2 h (full MILP)≈ 15 s (online)
ScalabilityHandles up to 4 modelsHandles 6+ models, 20 GPU types without degradation

Key takeaways:

  • Heterogeneous GPUs are not a liability – older GPUs can still be profitably used when the optimizer matches them to models that are less memory‑intensive.
  • Joint planning beats per‑model heuristics – treating each model independently leads to up to 40 % higher cost because of fragmented GPU utilization.
  • Fast re‑optimization enables elasticity – Coral reacts to demand spikes within seconds, keeping latency SLAs intact.

Practical Implications

  • Cloud‑native inference platforms (e.g., AWS SageMaker, Azure ML, GCP Vertex AI) can integrate Coral’s scheduler to automatically lower operating expenses without manual GPU selection.
  • Start‑ups and SaaS providers that need to expose multiple LLM APIs (chat, summarization, code generation) can run them on cheaper, mixed‑generation GPU fleets, freeing budget for product development.
  • Edge‑to‑cloud hybrid deployments can adopt the same principle: allocate newer, high‑throughput GPUs to latency‑critical models, and push batch‑oriented models to older hardware.
  • DevOps tooling – The two‑stage decomposition can be wrapped into a CI/CD pipeline that re‑generates the offline frontier whenever a new model version or GPU type is added, keeping the system future‑proof.

Limitations & Future Work

  • Static profiling assumption: Coral relies on accurate per‑GPU performance numbers; sudden driver updates or model quantization changes could invalidate the matrix and require re‑profiling.
  • Limited to token‑level throughput metrics: The current optimizer does not directly consider latency‑critical tail‑latency guarantees, which may be needed for real‑time chat use‑cases.
  • GPU memory fragmentation: While the system handles heterogeneous capacities, it does not yet support dynamic memory partitioning (e.g., model offloading) that could further increase packing density.
  • Future directions suggested by the authors include extending the framework to multi‑node GPU clusters, incorporating CPU‑offload strategies, and adding SLA‑aware latency constraints to the optimization model.

Authors

  • Yixuan Mei
  • Zikun Li
  • Zixuan Chen
  • Shiqi Pan
  • Mengdi Wu
  • Xupeng Miao
  • Zhihao Jia
  • K. V. Rashmi

Paper Information

  • arXiv ID: 2605.04357v1
  • Categories: cs.DC, cs.AI, cs.CL, cs.LG
  • Published: May 5, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Fast Byte Latent Transformer

Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their utility is limited by slo...