[Paper] BOute: Cost-Efficient LLM Serving with Heterogeneous LLMs and GPUs via Multi-Objective Bayesian Optimization

Published: 3 days ago (February 11, 2026 at 05:44 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.10729v1

Overview

Deploying large language models (LLMs) at scale is expensive, and the industry is racing to squeeze more performance out of every dollar spent on GPU hardware. BOute tackles this head‑on by jointly deciding which model should answer a given query and where each model should run on a mix of GPUs. Using a multi‑objective Bayesian optimizer, the system balances latency, answer quality, and cost, delivering up to 157 % better cost‑efficiency than existing serving stacks.

Key Contributions

Unified algorithm‑system co‑design: Simultaneously optimizes heterogeneous query routing and heterogeneous GPU placement, something prior work treated separately.
Multi‑objective Bayesian Optimization (MOBO) engine: Formulates cost, latency, and quality as competing objectives and searches the massive configuration space efficiently.
Quality‑aware scheduling: Guarantees a user‑specified answer quality (e.g., BLEU, ROUGE, or model‑specific confidence) while still minimizing spend.
Practical deployment framework: Works with off‑the‑shelf LLMs (e.g., Llama‑2, Mistral) and heterogeneous GPU clusters (A100, RTX 4090, T4, etc.) without requiring custom kernels.
Empirical validation: Shows up to 157 % improvement over state‑of‑the‑art serving systems and cost reductions of 15‑61 % for the same latency/quality targets.

Methodology

Model & GPU Heterogeneity Catalog – BOute first builds a lightweight performance profile for each LLM on each GPU type (throughput, latency, memory footprint, and a proxy for answer quality).
Query Characterization – Incoming requests are classified by estimated difficulty (e.g., token length, required reasoning depth) using a fast classifier.
Decision Variables
- Routing: which model (low‑cost vs. high‑cost) handles each query class.
- Placement: how many replicas of each model run on each GPU type, and the parallelism strategy (tensor‑parallel, pipeline‑parallel).
Multi‑Objective Bayesian Optimization – The optimizer treats cost, latency, and quality as separate objectives. It iteratively proposes new routing‑placement configurations, evaluates them on a small set of real queries, and updates a surrogate model (Gaussian Process) to predict performance across the whole space.
Constraint Handling – Quality constraints are enforced as hard limits; latency is treated as a soft objective that the optimizer tries to keep under a user‑defined SLA.
Online Adaptation – BOute periodically re‑runs the MOBO loop to adapt to workload shifts (e.g., sudden surge of complex queries) without full retraining.

Results & Findings

Metric	Baseline (e.g., TGI, vLLM)	BOute	Improvement
Cost per 1 M tokens	$0.92	$0.36	61 % lower
99th‑percentile latency (ms)	820	560	32 % faster
Average answer quality (BLEU)	0.71	0.71 (constraint met)	—
Overall cost‑efficiency score	1.0×	2.57×	—

Cost‑efficiency boost: Up to 157 % higher than the best competing system when both operate under the same budget and quality target.
Budget‑saving mode: Holding latency and quality constant, BOute cuts spend by an average of 38 % (range 15‑61 %).
Robustness to workload mix: The optimizer gracefully shifts more queries to cheaper models when the proportion of “simple” requests rises, and re‑allocates GPU resources to high‑end cards when complex queries dominate.

Practical Implications

For SaaS AI providers: Reduce cloud GPU bills dramatically without sacrificing SLA guarantees—especially valuable for multi‑tenant platforms that serve both chat and code‑generation workloads.
For on‑premise data centers: Leverage existing heterogeneous GPU fleets (mix of older V100s, newer A100s, consumer‑grade RTX cards) instead of over‑provisioning expensive homogeneous clusters.
Developer tooling: BOute’s profiling and optimizer can be wrapped as a library (Python API) that plugs into existing inference servers (e.g., vLLM, TGI), letting engineers experiment with cost‑aware routing policies without rewriting model code.
Dynamic pricing models: Cloud providers could expose “budget‑aware” endpoints that automatically invoke BOute‑style scheduling, giving customers transparent cost‑vs‑quality trade‑offs.

Limitations & Future Work

Profiling overhead: The initial performance catalog requires running each model on each GPU type, which can be time‑consuming for very large model families.
Static quality proxy: BOute assumes a fixed quality threshold per model; future work could incorporate per‑query confidence estimation to allow finer‑grained quality control.
Scalability of the optimizer: While MOBO scales well for a handful of models and GPU types, extremely large heterogeneous fleets may need hierarchical or reinforcement‑learning‑based schedulers.
Security & isolation: The current prototype does not address multi‑tenant isolation (e.g., memory cgroup enforcement), an important consideration for production deployments.

Bottom line: BOute demonstrates that a principled, Bayesian‑driven co‑optimization of query routing and GPU placement can unlock substantial cost savings for LLM serving—making high‑quality AI services more affordable for developers and enterprises alike.

Authors

Youhe Jiang
Fangcheng Fu
Eiko Yoneki

Paper Information

arXiv ID: 2602.10729v1
Categories: cs.DC
Published: February 11, 2026
PDF: Download PDF

[Paper] BOute: Cost-Efficient LLM Serving with Heterogeneous LLMs and GPUs via Multi-Objective Bayesian Optimization

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Legitimate Overrides in Decentralized Protocols

[Paper] OServe: Accelerating LLM Serving via Spatial-Temporal Workload Orchestration

[Paper] Contention Resolution, With and Without a Global Clock

[Paper] An Auction-Based Mechanism for Optimal Task Allocation and Resource Aware Containerization