Your AI SRE Doesn't Need One Model — It Needs the Right Model for Each Job
Source: Dev.to
Introduction
We built our first AI SRE integration with a single model: Opus for everything — incident triage, Kubernetes debugging, IAM policy review, cost‑anomaly detection. The idea was to use the best‑available model and not overthink it.
Three months in, the cost was real. And honestly, most of the tasks didn’t need Opus‑grade reasoning. Checking if a pod is in CrashLoopBackOff doesn’t require the same cognitive load as parsing a complex cross‑account IAM policy trust relationship.
Rootly published benchmark results this week that put actual numbers on a hunch most of us have been carrying. If you’re building AI SRE tooling — or about to — the findings are worth sitting with.
What the Benchmarks Found
Rootly ran Claude Sonnet 4.6 and Opus across four infrastructure task types:
| Task Type | Model Tested |
|---|---|
| Kubernetes debugging | Sonnet 4.6, Opus |
| Compute anomaly detection | Sonnet 4.6, Opus |
| IAM / S3 policy review | Sonnet 4.6, Opus |
| General infra work | Sonnet 4.6, Opus |
Key takeaway: Sonnet 4.6 performs comparably to Opus on Kubernetes and compute tasks. The gap widens on complex IAM and policy reasoning — that’s where Opus pulls ahead noticeably.
Why this makes sense
- K8s debugging is largely pattern‑matching plus log interpretation (e.g.,
OOMKilled,CrashLoopBackOff). The model only needs to recognize a known pattern and suggest a known fix – a smaller, faster model handles it well. - IAM is different. Cross‑account trust policies, condition keys, SCPs interacting with permission boundaries, and AssumeRole chains require deep dependency reasoning. One wrong inference can change the security posture of an entire account, so higher reasoning capacity matters.
What Model Routing Looks Like in Practice
You don’t need a fancy framework to start. The simplest version is a routing function that maps task type → model at the entry point:
# Mapping of task types to the model that should handle them
TASK_MODEL_MAP = {
"k8s_debug": "claude-sonnet-4-6",
"compute_anomaly": "claude-sonnet-4-6",
"cost_analysis": "claude-sonnet-4-6",
"iam_policy_review": "claude-opus-4-6",
"security_audit": "claude-opus-4-6",
"incident_triage": "claude-sonnet-4-6", # fast first pass
"incident_rca": "claude-opus-4-6", # deep analysis on escalation
}
def route_task(task_type: str, payload: dict) -> str:
"""
Choose the appropriate model based on task_type and invoke the LLM.
"""
model = TASK_MODEL_MAP.get(task_type, "claude-sonnet-4-6")
return call_llm(model, payload)
You classify the task type at the entry point — from alert metadata, the PagerDuty service name, or a lightweight pre‑routing call — and route accordingly.
Two‑stage routing for incidents
- Fast first‑pass triage (Sonnet): “Is this P1? What’s the likely cause?”
- Deep RCA (Opus): Triggered if the incident exceeds 15 minutes or the initial assessment is inconclusive.
Most incidents never need the second stage, saving cost while still providing depth when required.
The Cost Math
At Anthropic’s current pricing, Opus is roughly 3–5× the cost of Sonnet per output token.
- Scenario: 200 alerts/day processed with Opus on everything.
- Routing 70 % of those to Sonnet (the predictable K8s & compute tasks) saves 60–70 % of monthly LLM spend without sacrificing quality where it matters.
At scale — a platform team handling 500+ alerts/day across 20 services — that’s a meaningful reduction. The principle mirrors classic FinOps: right‑size resources to the workload.
What Teams Get Wrong When They Start Doing This
| Pitfall | Why It Matters | Quick Fix |
|---|---|---|
| Routing confidence | Mis‑classifying a security‑adjacent task can be costly. | Default to Opus when the classifier is uncertain. |
| Skipping caching | Even with a cheaper model, repeated context inflates token usage. | Cache system prompts and stable user‑message portions → 40–60 % token savings. |
| Missing observability | Without per‑model & per‑task metrics you can’t tell why spend spikes. | Add a Prometheus counter: llm_requests_total{model="claude-sonnet-4-6", task_type="k8s_debug"} llm_requests_total{model="claude-opus-4-6", task_type="iam_policy_review"} |
A few minutes of instrumentation saves hours of confusion when the bill arrives.
What I’d Do Differently
- Log task type and model from day 0 — even before you build routing logic.
- Run a single model for a month, tagging every request with its task type.
- By month‑end you’ll have real data on task distribution (e.g., % K8s debugging vs. IAM work).
- You’ll know where the long‑tail reasoning tasks actually appear.
- Build the router with data, not intuition.
We’re still iterating on classification as new alert types emerge, and we don’t claim our routing is perfect. But having observability in place from the start makes it far easier to refine.
Feel free to adapt the snippets and ideas to your own stack. The key is simple: match the right model to the right job, and you’ll save money without compromising reliability.
Where This Goes
This is the beginning of LLMOps as a real discipline. Most teams right now are at “pick a model and use it everywhere” — which is fine for experimentation, and honestly fine for small scale. But as AI SRE moves from pilot to production, the operational concerns show up: cost, reliability, latency, quality by task type.
The teams that treat LLM infrastructure the way they treat compute infrastructure — with cost visibility, right‑sizing, and observability — will have a meaningful advantage over the ones still paying Opus rates to classify pod restarts.
Rootly’s benchmarks are one data point. Your production data is a better one. Start collecting it.
If you’re building AI SRE tooling and hitting interesting edge cases in model routing or task classification, reach out — I’m genuinely curious what patterns other teams are finding.