[Paper] CALM: A Self-Adaptive Orchestration Approach for QoS-Aware Routing in Small Language Model based Systems

Published: 3 months ago (February 3, 2026 at 10:20 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.03632v1

Overview

The paper presents CALM, a self‑adaptive orchestration layer that dynamically selects and routes requests to the most suitable small language model (SLM) from a fleet of specialized models. By continuously monitoring workload characteristics and QoS metrics (latency, energy, response quality), CALM can cut inference latency by ~40 % and halve energy use while keeping task performance on par with single‑model deployments.

Key Contributions

QoS‑aware multi‑model orchestration – Introduces a MAPE‑K (Monitor‑Analyze‑Plan‑Execute‑Knowledge) loop that decides, per request, which SLM should handle the query.
Dynamic caching & scheduling – A lightweight scheduler keeps the most promising SLMs resident in memory, reducing cold‑start overhead.
Empirical validation – Experiments on several domain‑specific benchmarks show up to 40 % latency reduction and 50 % energy savings versus the best single‑SLM baseline.
Open‑source reference implementation – The authors release a prototype that can be plugged into existing inference pipelines (e.g., Hugging Face Transformers, FastAPI).

Methodology

Monitoring – Every incoming user query is logged together with runtime signals (token length, request rate, hardware utilization).
Analysis – A lightweight predictor estimates the expected latency, energy cost, and quality for each SLM in the fleet given the current context.
Planning – The system ranks SLMs based on a weighted QoS score (configurable by the operator).
Execution – The top‑ranked model receives the request; if the model is not already loaded, CALM triggers a pre‑fetch based on the scheduler’s cache policy.
Knowledge base – Historical performance data continuously updates the predictor, allowing the loop to adapt to model drift or hardware changes.

The orchestration logic is implemented as a thin middleware layer that can sit in front of any inference server, requiring only standard REST/gRPC hooks.

Results & Findings

Metric	Single‑SLM Baseline	CALM (multi‑SLM)
End‑to‑end latency (ms)	210	124 (≈ 40 % reduction)
Energy per query (J)	1.8	0.9 (≈ 50 % reduction)
Task accuracy (BLEU/F1)	0.84	0.83 (no statistically significant drop)
Cache hit rate	N/A	68 % (thanks to smart pre‑loading)

Key takeaways

Latency gains stem mainly from routing short, latency‑sensitive queries to ultra‑lightweight SLMs, while delegating complex, high‑quality‑required queries to larger, more capable models.
Energy savings arise from keeping only a subset of models resident and avoiding unnecessary heavyweight inference.
Quality preservation is achieved by the QoS‑aware scoring that never sacrifices a model’s domain expertise when the task demands it.

Practical Implications

Edge & on‑prem deployments – Companies can run a mixed fleet of tiny (e.g., 80M‑parameter) and medium (e.g., 300M‑parameter) models on a single GPU/CPU box, delivering fast responses without the cloud‑API cost or data‑privacy concerns.
Cost‑effective scaling – Cloud providers can charge per‑model instance; CALM’s ability to keep only the needed models warm reduces VM/instance usage, translating into lower operational spend.
Developer ergonomics – The middleware abstracts away model‑selection logic; developers simply register new SLMs with a metadata file and let CALM handle routing.
Adaptive compliance – In regulated environments where certain data must stay on‑prem, CALM can enforce policies that route sensitive queries to locally hosted SLMs while sending non‑sensitive ones to cheaper cloud APIs.

Limitations & Future Work

Model heterogeneity overhead – The current prototype assumes all SLMs share the same tokenizer and input format; extending to truly heterogeneous architectures (e.g., encoder‑decoder vs. decoder‑only) requires additional plumbing.
Cold‑start latency – Although caching mitigates it, the first request to a rarely used model still incurs a noticeable load time; future work could explore predictive warm‑up based on workload forecasting.
QoS metric weighting – The scoring function is manually tuned; learning optimal weights automatically from SLAs or business objectives remains an open challenge.
Security & isolation – Running multiple models in the same process may raise isolation concerns; container‑level sandboxing is a potential direction.

Overall, CALM demonstrates that a smart, self‑adaptive orchestration layer can unlock the efficiency of small language models without sacrificing the performance that users expect—an insight that could reshape how AI services are deployed at scale.

Authors

Hemang Jain
Divyansh Pandey
Karthik Vaidhyanathan

Paper Information

arXiv ID: 2602.03632v1
Categories: cs.SE
Published: February 3, 2026
PDF: Download PDF

[Paper] CALM: A Self-Adaptive Orchestration Approach for QoS-Aware Routing in Small Language Model based Systems

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Statistical-Based Metric Threshold Setting Method for Software Fault Prediction in Firmware Projects: An Industrial Experience

[Paper] Using Large Language Models to Support Automation of Failure Management in CI/CD Pipelines: A Case Study in SAP HANA

[Paper] Trustworthy AI Software Engineers

[Paper] Scaling Mobile Chaos Testing with AI-Driven Test Execution