[Paper] BOute: Cost-Efficient LLM Serving with Heterogeneous LLMs and GPUs via Multi-Objective Bayesian Optimization

Published: (February 11, 2026 at 05:44 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.10729v1

Overview

Deploying large language models (LLMs) at scale is expensive, and the industry is racing to squeeze more performance out of every dollar spent on GPU hardware. BOute tackles this head‑on by jointly deciding which model should answer a given query and where each model should run on a mix of GPUs. Using a multi‑objective Bayesian optimizer, the system balances latency, answer quality, and cost, delivering up to 157 % better cost‑efficiency than existing serving stacks.

Key Contributions

  • Unified algorithm‑system co‑design: Simultaneously optimizes heterogeneous query routing and heterogeneous GPU placement, something prior work treated separately.
  • Multi‑objective Bayesian Optimization (MOBO) engine: Formulates cost, latency, and quality as competing objectives and searches the massive configuration space efficiently.
  • Quality‑aware scheduling: Guarantees a user‑specified answer quality (e.g., BLEU, ROUGE, or model‑specific confidence) while still minimizing spend.
  • Practical deployment framework: Works with off‑the‑shelf LLMs (e.g., Llama‑2, Mistral) and heterogeneous GPU clusters (A100, RTX 4090, T4, etc.) without requiring custom kernels.
  • Empirical validation: Shows up to 157 % improvement over state‑of‑the‑art serving systems and cost reductions of 15‑61 % for the same latency/quality targets.

Methodology

  1. Model & GPU Heterogeneity Catalog – BOute first builds a lightweight performance profile for each LLM on each GPU type (throughput, latency, memory footprint, and a proxy for answer quality).
  2. Query Characterization – Incoming requests are classified by estimated difficulty (e.g., token length, required reasoning depth) using a fast classifier.
  3. Decision Variables
    • Routing: which model (low‑cost vs. high‑cost) handles each query class.
    • Placement: how many replicas of each model run on each GPU type, and the parallelism strategy (tensor‑parallel, pipeline‑parallel).
  4. Multi‑Objective Bayesian Optimization – The optimizer treats cost, latency, and quality as separate objectives. It iteratively proposes new routing‑placement configurations, evaluates them on a small set of real queries, and updates a surrogate model (Gaussian Process) to predict performance across the whole space.
  5. Constraint Handling – Quality constraints are enforced as hard limits; latency is treated as a soft objective that the optimizer tries to keep under a user‑defined SLA.
  6. Online Adaptation – BOute periodically re‑runs the MOBO loop to adapt to workload shifts (e.g., sudden surge of complex queries) without full retraining.

Results & Findings

MetricBaseline (e.g., TGI, vLLM)BOuteImprovement
Cost per 1 M tokens$0.92$0.3661 % lower
99th‑percentile latency (ms)82056032 % faster
Average answer quality (BLEU)0.710.71 (constraint met)
Overall cost‑efficiency score1.0×2.57×
  • Cost‑efficiency boost: Up to 157 % higher than the best competing system when both operate under the same budget and quality target.
  • Budget‑saving mode: Holding latency and quality constant, BOute cuts spend by an average of 38 % (range 15‑61 %).
  • Robustness to workload mix: The optimizer gracefully shifts more queries to cheaper models when the proportion of “simple” requests rises, and re‑allocates GPU resources to high‑end cards when complex queries dominate.

Practical Implications

  • For SaaS AI providers: Reduce cloud GPU bills dramatically without sacrificing SLA guarantees—especially valuable for multi‑tenant platforms that serve both chat and code‑generation workloads.
  • For on‑premise data centers: Leverage existing heterogeneous GPU fleets (mix of older V100s, newer A100s, consumer‑grade RTX cards) instead of over‑provisioning expensive homogeneous clusters.
  • Developer tooling: BOute’s profiling and optimizer can be wrapped as a library (Python API) that plugs into existing inference servers (e.g., vLLM, TGI), letting engineers experiment with cost‑aware routing policies without rewriting model code.
  • Dynamic pricing models: Cloud providers could expose “budget‑aware” endpoints that automatically invoke BOute‑style scheduling, giving customers transparent cost‑vs‑quality trade‑offs.

Limitations & Future Work

  • Profiling overhead: The initial performance catalog requires running each model on each GPU type, which can be time‑consuming for very large model families.
  • Static quality proxy: BOute assumes a fixed quality threshold per model; future work could incorporate per‑query confidence estimation to allow finer‑grained quality control.
  • Scalability of the optimizer: While MOBO scales well for a handful of models and GPU types, extremely large heterogeneous fleets may need hierarchical or reinforcement‑learning‑based schedulers.
  • Security & isolation: The current prototype does not address multi‑tenant isolation (e.g., memory cgroup enforcement), an important consideration for production deployments.

Bottom line: BOute demonstrates that a principled, Bayesian‑driven co‑optimization of query routing and GPU placement can unlock substantial cost savings for LLM serving—making high‑quality AI services more affordable for developers and enterprises alike.

Authors

  • Youhe Jiang
  • Fangcheng Fu
  • Eiko Yoneki

Paper Information

  • arXiv ID: 2602.10729v1
  • Categories: cs.DC
  • Published: February 11, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »