[Paper] CALM: A Self-Adaptive Orchestration Approach for QoS-Aware Routing in Small Language Model based Systems
Source: arXiv - 2602.03632v1
Overview
The paper presents CALM, a self‑adaptive orchestration layer that dynamically selects and routes requests to the most suitable small language model (SLM) from a fleet of specialized models. By continuously monitoring workload characteristics and QoS metrics (latency, energy, response quality), CALM can cut inference latency by ~40 % and halve energy use while keeping task performance on par with single‑model deployments.
Key Contributions
- QoS‑aware multi‑model orchestration – Introduces a MAPE‑K (Monitor‑Analyze‑Plan‑Execute‑Knowledge) loop that decides, per request, which SLM should handle the query.
- Dynamic caching & scheduling – A lightweight scheduler keeps the most promising SLMs resident in memory, reducing cold‑start overhead.
- Empirical validation – Experiments on several domain‑specific benchmarks show up to 40 % latency reduction and 50 % energy savings versus the best single‑SLM baseline.
- Open‑source reference implementation – The authors release a prototype that can be plugged into existing inference pipelines (e.g., Hugging Face Transformers, FastAPI).
Methodology
- Monitoring – Every incoming user query is logged together with runtime signals (token length, request rate, hardware utilization).
- Analysis – A lightweight predictor estimates the expected latency, energy cost, and quality for each SLM in the fleet given the current context.
- Planning – The system ranks SLMs based on a weighted QoS score (configurable by the operator).
- Execution – The top‑ranked model receives the request; if the model is not already loaded, CALM triggers a pre‑fetch based on the scheduler’s cache policy.
- Knowledge base – Historical performance data continuously updates the predictor, allowing the loop to adapt to model drift or hardware changes.
The orchestration logic is implemented as a thin middleware layer that can sit in front of any inference server, requiring only standard REST/gRPC hooks.
Results & Findings
| Metric | Single‑SLM Baseline | CALM (multi‑SLM) |
|---|---|---|
| End‑to‑end latency (ms) | 210 | 124 (≈ 40 % reduction) |
| Energy per query (J) | 1.8 | 0.9 (≈ 50 % reduction) |
| Task accuracy (BLEU/F1) | 0.84 | 0.83 (no statistically significant drop) |
| Cache hit rate | N/A | 68 % (thanks to smart pre‑loading) |
Key takeaways
- Latency gains stem mainly from routing short, latency‑sensitive queries to ultra‑lightweight SLMs, while delegating complex, high‑quality‑required queries to larger, more capable models.
- Energy savings arise from keeping only a subset of models resident and avoiding unnecessary heavyweight inference.
- Quality preservation is achieved by the QoS‑aware scoring that never sacrifices a model’s domain expertise when the task demands it.
Practical Implications
- Edge & on‑prem deployments – Companies can run a mixed fleet of tiny (e.g., 80M‑parameter) and medium (e.g., 300M‑parameter) models on a single GPU/CPU box, delivering fast responses without the cloud‑API cost or data‑privacy concerns.
- Cost‑effective scaling – Cloud providers can charge per‑model instance; CALM’s ability to keep only the needed models warm reduces VM/instance usage, translating into lower operational spend.
- Developer ergonomics – The middleware abstracts away model‑selection logic; developers simply register new SLMs with a metadata file and let CALM handle routing.
- Adaptive compliance – In regulated environments where certain data must stay on‑prem, CALM can enforce policies that route sensitive queries to locally hosted SLMs while sending non‑sensitive ones to cheaper cloud APIs.
Limitations & Future Work
- Model heterogeneity overhead – The current prototype assumes all SLMs share the same tokenizer and input format; extending to truly heterogeneous architectures (e.g., encoder‑decoder vs. decoder‑only) requires additional plumbing.
- Cold‑start latency – Although caching mitigates it, the first request to a rarely used model still incurs a noticeable load time; future work could explore predictive warm‑up based on workload forecasting.
- QoS metric weighting – The scoring function is manually tuned; learning optimal weights automatically from SLAs or business objectives remains an open challenge.
- Security & isolation – Running multiple models in the same process may raise isolation concerns; container‑level sandboxing is a potential direction.
Overall, CALM demonstrates that a smart, self‑adaptive orchestration layer can unlock the efficiency of small language models without sacrificing the performance that users expect—an insight that could reshape how AI services are deployed at scale.
Authors
- Hemang Jain
- Divyansh Pandey
- Karthik Vaidhyanathan
Paper Information
- arXiv ID: 2602.03632v1
- Categories: cs.SE
- Published: February 3, 2026
- PDF: Download PDF