[Paper] ORACL: Optimized Reasoning for Autoscaling via Chain of Thought with LLMs for Microservices

Published: (February 4, 2026 at 11:27 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.05292v1

Overview

The paper introduces ORACL, a novel framework that uses large language models (LLMs) to automate autoscaling decisions for microservice‑based applications. By turning raw telemetry into natural‑language descriptions and prompting an LLM to “think out loud,” ORACL can diagnose performance problems and recommend safe resource adjustments without the need for per‑deployment training.

Key Contributions

  • LLM‑driven autoscaling: Demonstrates that a single, pre‑trained LLM can act as a universal, few‑shot resource allocator across diverse microservice workloads.
  • Chain‑of‑thought prompting: Introduces a structured prompting technique that forces the LLM to produce an interpretable reasoning trace (root‑cause hypothesis → action pruning → allocation decision).
  • Semantic telemetry translation: Converts low‑level metrics (CPU, memory, latency, replica counts, fault signals) into concise natural‑language state descriptions for the LLM.
  • Policy‑aware decision making: Embeds safety constraints (e.g., max/min replicas, budget caps) directly into the LLM’s output validation step.
  • Empirical gains: Shows a 15 % boost in root‑cause identification accuracy, up to 24× faster “training” (i.e., prompt tuning) compared with traditional RL‑based autoscalers, and a 6 % QoS improvement in short‑term bursts.

Methodology

  1. Telemetry Collection – ORACL continuously gathers runtime data from Kubernetes (pods, replica counts, CPU/memory usage, request latency, SLO violations, and fault events).

  2. Natural‑Language Encoding – A lightweight transformer converts the raw metrics into a short paragraph such as:

    “Service A is running 3 replicas, CPU at 78 %, memory at 62 %; latency avg = 210 ms (SLO = 200 ms); recent pod restarts observed.”

  3. Chain‑of‑Thought Prompt – The encoded state is fed to a pre‑trained LLM (e.g., GPT‑4) with a prompt that asks it to:

    • List possible root causes (e.g., CPU saturation, memory pressure, network throttling).
    • Rank them based on the evidence.
    • Suggest a minimal set of scaling actions that would resolve the top‑ranked cause while respecting policy limits.
  4. Reasoning Trace Extraction – The LLM’s output includes an explicit step‑by‑step trace, which is parsed and logged for human auditability.

  5. Action Pruning & Enforcement – The trace is used to narrow the action space (e.g., only increase CPU limits, not replicas) and a final validator checks that the suggested allocation obeys safety constraints before applying it via the Kubernetes autoscaler API.

The whole pipeline runs in near‑real‑time (sub‑second latency) and requires only a few example prompts to adapt to a new microservice deployment.

Results & Findings

MetricBaseline (hand‑tuned / RL)ORACL
Root‑cause identification accuracy68 %83 % (+15 %)
Training / prompt‑tuning time*~12 h (RL)≈30 min (≈24× faster)
QoS (SLO compliance) under burst load92 %98 % (+6 %)
Scaling decision latency1.2 s0.4 s

*Training here refers to the time needed to collect enough data for a reinforcement‑learning autoscaler to converge; ORACL only needs a handful of few‑shot examples.

The authors also report that the reasoning traces are human‑readable and helped operators quickly verify why a scaling action was taken, reducing debugging time.

Practical Implications

  • Universal Autoscaler: Teams can drop a single ORACL agent into any Kubernetes cluster and get competent autoscaling without custom model training.
  • Reduced Ops Overhead: The chain‑of‑thought trace serves as documentation, making it easier for SREs to audit and trust automated decisions.
  • Faster Incident Response: By pinpointing root causes in real time, ORACL can trigger targeted scaling (e.g., only increase CPU for a hot service) instead of blanket replica spikes, saving cloud spend.
  • Portability: Because the LLM is pre‑trained, the same prompt library works across languages, frameworks, and cloud providers, easing multi‑cloud strategies.
  • Safety Guarantees: Policy constraints baked into the validation step prevent runaway scaling that could breach budgets or violate capacity caps.

Developers can integrate ORACL via a lightweight sidecar or as a custom controller in the Kubernetes control plane, leveraging existing observability stacks (Prometheus, OpenTelemetry) for telemetry ingestion.

Limitations & Future Work

  • LLM Hallucination Risk: Although the chain‑of‑thought prompt reduces nonsense, the LLM can still propose irrelevant causes; a fallback verifier is needed for production safety.
  • Prompt Engineering Overhead: Crafting optimal prompts for highly specialized services may require domain expertise.
  • Scalability of the LLM Service: Running a large model (e.g., GPT‑4) for every scaling decision can be costly; future work could explore distilled or on‑edge models.
  • Broader Workload Diversity: Experiments were limited to a few open‑source microservice benchmarks; testing on large‑scale, heterogeneous enterprise workloads remains an open step.

The authors plan to explore automated prompt refinement, integrate model‑distillation pipelines, and evaluate ORACL in multi‑tenant SaaS platforms.

Authors

  • Haoyu Bai
  • Muhammed Tawfiqul Islam
  • Minxian Xu
  • Rajkumar Buyya

Paper Information

  • arXiv ID: 2602.05292v1
  • Categories: cs.DC
  • Published: February 5, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »