Your AI SRE Doesn't Need One Model — It Needs the Right Model for Each Job

Published: 3 days ago (February 21, 2026 at 03:27 AM EST)

5 min read

Source: Dev.to

Introduction

We built our first AI SRE integration with a single model: Opus for everything — incident triage, Kubernetes debugging, IAM policy review, cost‑anomaly detection. The idea was to use the best‑available model and not overthink it.

Three months in, the cost was real. And honestly, most of the tasks didn’t need Opus‑grade reasoning. Checking if a pod is in CrashLoopBackOff doesn’t require the same cognitive load as parsing a complex cross‑account IAM policy trust relationship.

Rootly published benchmark results this week that put actual numbers on a hunch most of us have been carrying. If you’re building AI SRE tooling — or about to — the findings are worth sitting with.

What the Benchmarks Found

Rootly ran Claude Sonnet 4.6 and Opus across four infrastructure task types:

Task Type	Model Tested
Kubernetes debugging	Sonnet 4.6, Opus
Compute anomaly detection	Sonnet 4.6, Opus
IAM / S3 policy review	Sonnet 4.6, Opus
General infra work	Sonnet 4.6, Opus

Key takeaway: Sonnet 4.6 performs comparably to Opus on Kubernetes and compute tasks. The gap widens on complex IAM and policy reasoning — that’s where Opus pulls ahead noticeably.

Why this makes sense

K8s debugging is largely pattern‑matching plus log interpretation (e.g., OOMKilled, CrashLoopBackOff). The model only needs to recognize a known pattern and suggest a known fix – a smaller, faster model handles it well.
IAM is different. Cross‑account trust policies, condition keys, SCPs interacting with permission boundaries, and AssumeRole chains require deep dependency reasoning. One wrong inference can change the security posture of an entire account, so higher reasoning capacity matters.

What Model Routing Looks Like in Practice

You don’t need a fancy framework to start. The simplest version is a routing function that maps task type → model at the entry point:

# Mapping of task types to the model that should handle them
TASK_MODEL_MAP = {
    "k8s_debug":          "claude-sonnet-4-6",
    "compute_anomaly":    "claude-sonnet-4-6",
    "cost_analysis":      "claude-sonnet-4-6",
    "iam_policy_review":  "claude-opus-4-6",
    "security_audit":    "claude-opus-4-6",
    "incident_triage":    "claude-sonnet-4-6",   # fast first pass
    "incident_rca":       "claude-opus-4-6",     # deep analysis on escalation
}

def route_task(task_type: str, payload: dict) -> str:
    """
    Choose the appropriate model based on task_type and invoke the LLM.
    """
    model = TASK_MODEL_MAP.get(task_type, "claude-sonnet-4-6")
    return call_llm(model, payload)

You classify the task type at the entry point — from alert metadata, the PagerDuty service name, or a lightweight pre‑routing call — and route accordingly.

Two‑stage routing for incidents

Fast first‑pass triage (Sonnet): “Is this P1? What’s the likely cause?”
Deep RCA (Opus): Triggered if the incident exceeds 15 minutes or the initial assessment is inconclusive.

Most incidents never need the second stage, saving cost while still providing depth when required.

The Cost Math

At Anthropic’s current pricing, Opus is roughly 3–5× the cost of Sonnet per output token.

Scenario: 200 alerts/day processed with Opus on everything.
Routing 70 % of those to Sonnet (the predictable K8s & compute tasks) saves 60–70 % of monthly LLM spend without sacrificing quality where it matters.

At scale — a platform team handling 500+ alerts/day across 20 services — that’s a meaningful reduction. The principle mirrors classic FinOps: right‑size resources to the workload.

What Teams Get Wrong When They Start Doing This

Pitfall	Why It Matters	Quick Fix
Routing confidence	Mis‑classifying a security‑adjacent task can be costly.	Default to Opus when the classifier is uncertain.
Skipping caching	Even with a cheaper model, repeated context inflates token usage.	Cache system prompts and stable user‑message portions → 40–60 % token savings.
Missing observability	Without per‑model & per‑task metrics you can’t tell why spend spikes.	Add a Prometheus counter: `llm_requests_total{model="claude-sonnet-4-6", task_type="k8s_debug"}` `llm_requests_total{model="claude-opus-4-6", task_type="iam_policy_review"}`

A few minutes of instrumentation saves hours of confusion when the bill arrives.

What I’d Do Differently

Log task type and model from day 0 — even before you build routing logic.
Run a single model for a month, tagging every request with its task type.
- By month‑end you’ll have real data on task distribution (e.g., % K8s debugging vs. IAM work).
- You’ll know where the long‑tail reasoning tasks actually appear.
Build the router with data, not intuition.

We’re still iterating on classification as new alert types emerge, and we don’t claim our routing is perfect. But having observability in place from the start makes it far easier to refine.

Feel free to adapt the snippets and ideas to your own stack. The key is simple: match the right model to the right job, and you’ll save money without compromising reliability.

Where This Goes

This is the beginning of LLMOps as a real discipline. Most teams right now are at “pick a model and use it everywhere” — which is fine for experimentation, and honestly fine for small scale. But as AI SRE moves from pilot to production, the operational concerns show up: cost, reliability, latency, quality by task type.

The teams that treat LLM infrastructure the way they treat compute infrastructure — with cost visibility, right‑sizing, and observability — will have a meaningful advantage over the ones still paying Opus rates to classify pod restarts.

Rootly’s benchmarks are one data point. Your production data is a better one. Start collecting it.

If you’re building AI SRE tooling and hitting interesting edge cases in model routing or task classification, reach out — I’m genuinely curious what patterns other teams are finding.

Your AI SRE Doesn't Need One Model — It Needs the Right Model for Each Job

Introduction

What the Benchmarks Found

Why this makes sense

What Model Routing Looks Like in Practice

Two‑stage routing for incidents

The Cost Math

What Teams Get Wrong When They Start Doing This

What I’d Do Differently

Where This Goes

Related posts

PostGres Database Replication Using Pglogical

DumbQuestion.ai - '𝐉𝐮𝐬𝐭 𝐁𝐮𝐢𝐥𝐝 𝐈𝐭' 𝐁𝐞𝐜𝐨𝐦𝐞𝐬 𝐎𝐯𝐞𝐫𝐥𝐲 𝐎𝐫𝐠𝐚𝐧𝐢𝐳𝐞𝐝 𝐚𝐧𝐝 𝐏𝐫𝐞𝐩𝐚𝐫𝐞𝐝

Build a 6-DOF Arduino Robotic Arm with Web Control

Stop Building 'Lazy' Backends: Why the Future is Agentic FaaS and MDL.