Kubernetes rollouts: promote on SLOs, not on 'pods are Ready'
Source: Dev.to
Readiness is a local signal. Production impact is global.
Pods can be Ready while your SLO window is already burning.
The failure chain
- Rollout shifts traffic fast.
- New pods saturate before HPA reacts.
- HPA scrape window is 15–30 seconds minimum.
- P95 latency climbs.
- Error rate ticks up.
- SLI degrades.
Everything looks healthy, but the error budget is draining quietly.
Why “pods are Ready” lies to you
- Ready only means the container started and passed its health check.
- It says nothing about P95 latency, error rate, or whether your SLO slice is holding.
- A canary can get stuck on “green” because metrics are too coarse.
- No labels, no slices → blast radius stays invisible.
Three resolvers
1. Pre‑scale before the first canary step
- Bump replicas before traffic shifts.
- HPA catches up from a safe baseline instead of a saturated one.
2. Match step interval to your HPA scale‑up window
- Default stabilization window is 3 minutes.
- Check yours with:
kubectl get hpa -o yaml- Promoting before that window closes is promoting blind.
3. Gate steps on SLI health
- Wire an
AnalysisRunin Argo Rollouts that checks error rate and P95 latency are within SLO bounds before promoting. - If the SLI is still recovering, promotion waits.
The rule
Promote only when the canary holds the SLO slice that matters for a fixed window.
Anything outside that window triggers an auto‑rollback.
Rollout speed and autoscaler reaction time are tuned independently. That gap is where the error budget burns before anyone pages.
Deep dive
What is the step interval on your rollouts right now?