[Paper] CALM: A Self-Adaptive Orchestration Approach for QoS-Aware Routing in Small Language Model based Systems

Published: (February 3, 2026 at 10:20 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.03632v1

Overview

The paper presents CALM, a self‑adaptive orchestration layer that dynamically selects and routes requests to the most suitable small language model (SLM) from a fleet of specialized models. By continuously monitoring workload characteristics and QoS metrics (latency, energy, response quality), CALM can cut inference latency by ~40 % and halve energy use while keeping task performance on par with single‑model deployments.

Key Contributions

  • QoS‑aware multi‑model orchestration – Introduces a MAPE‑K (Monitor‑Analyze‑Plan‑Execute‑Knowledge) loop that decides, per request, which SLM should handle the query.
  • Dynamic caching & scheduling – A lightweight scheduler keeps the most promising SLMs resident in memory, reducing cold‑start overhead.
  • Empirical validation – Experiments on several domain‑specific benchmarks show up to 40 % latency reduction and 50 % energy savings versus the best single‑SLM baseline.
  • Open‑source reference implementation – The authors release a prototype that can be plugged into existing inference pipelines (e.g., Hugging Face Transformers, FastAPI).

Methodology

  1. Monitoring – Every incoming user query is logged together with runtime signals (token length, request rate, hardware utilization).
  2. Analysis – A lightweight predictor estimates the expected latency, energy cost, and quality for each SLM in the fleet given the current context.
  3. Planning – The system ranks SLMs based on a weighted QoS score (configurable by the operator).
  4. Execution – The top‑ranked model receives the request; if the model is not already loaded, CALM triggers a pre‑fetch based on the scheduler’s cache policy.
  5. Knowledge base – Historical performance data continuously updates the predictor, allowing the loop to adapt to model drift or hardware changes.

The orchestration logic is implemented as a thin middleware layer that can sit in front of any inference server, requiring only standard REST/gRPC hooks.

Results & Findings

MetricSingle‑SLM BaselineCALM (multi‑SLM)
End‑to‑end latency (ms)210124 (≈ 40 % reduction)
Energy per query (J)1.80.9 (≈ 50 % reduction)
Task accuracy (BLEU/F1)0.840.83 (no statistically significant drop)
Cache hit rateN/A68 % (thanks to smart pre‑loading)

Key takeaways

  • Latency gains stem mainly from routing short, latency‑sensitive queries to ultra‑lightweight SLMs, while delegating complex, high‑quality‑required queries to larger, more capable models.
  • Energy savings arise from keeping only a subset of models resident and avoiding unnecessary heavyweight inference.
  • Quality preservation is achieved by the QoS‑aware scoring that never sacrifices a model’s domain expertise when the task demands it.

Practical Implications

  • Edge & on‑prem deployments – Companies can run a mixed fleet of tiny (e.g., 80M‑parameter) and medium (e.g., 300M‑parameter) models on a single GPU/CPU box, delivering fast responses without the cloud‑API cost or data‑privacy concerns.
  • Cost‑effective scaling – Cloud providers can charge per‑model instance; CALM’s ability to keep only the needed models warm reduces VM/instance usage, translating into lower operational spend.
  • Developer ergonomics – The middleware abstracts away model‑selection logic; developers simply register new SLMs with a metadata file and let CALM handle routing.
  • Adaptive compliance – In regulated environments where certain data must stay on‑prem, CALM can enforce policies that route sensitive queries to locally hosted SLMs while sending non‑sensitive ones to cheaper cloud APIs.

Limitations & Future Work

  • Model heterogeneity overhead – The current prototype assumes all SLMs share the same tokenizer and input format; extending to truly heterogeneous architectures (e.g., encoder‑decoder vs. decoder‑only) requires additional plumbing.
  • Cold‑start latency – Although caching mitigates it, the first request to a rarely used model still incurs a noticeable load time; future work could explore predictive warm‑up based on workload forecasting.
  • QoS metric weighting – The scoring function is manually tuned; learning optimal weights automatically from SLAs or business objectives remains an open challenge.
  • Security & isolation – Running multiple models in the same process may raise isolation concerns; container‑level sandboxing is a potential direction.

Overall, CALM demonstrates that a smart, self‑adaptive orchestration layer can unlock the efficiency of small language models without sacrificing the performance that users expect—an insight that could reshape how AI services are deployed at scale.

Authors

  • Hemang Jain
  • Divyansh Pandey
  • Karthik Vaidhyanathan

Paper Information

  • arXiv ID: 2602.03632v1
  • Categories: cs.SE
  • Published: February 3, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »