canary vs rolling update
Source: Dev.to
Traffic Routing in Kubernetes
Kubernetes routes traffic by pod IPs, not by percentage.
A Service (virtual IP) forwards requests to any Ready pod; it has no knowledge of:
- application versions
- risk levels
- traffic percentages
User
↓
Service (virtual IP)
↓
Random Pod IP
If you have 10 pods and a new pod (v2) is added, you now have 11 pods. The Service may send all requests from a single user to the same pod (connection reuse), so that user could hit v2 for all of their requests. Consequently, a single pod can receive anywhere from 0 % to 80 % of the traffic.
Kubernetes makes no promise about which pods receive new code or how many users are affected.
Rolling Updates
Rolling updates guarantee that:
- Pods are not terminated all at once.
- Overall capacity stays up.
They do not guarantee:
- Which users get the new version.
- How many users are affected.
- That errors are limited.
Thus, rolling updates are availability‑safe, not user‑safe.
Example scenario
- v1 works, v2 introduces a token‑validation bug.
- One v2 pod starts.
- The first real user hits that pod → login fails.
- The user retries → the same pod (session stickiness) → user locked out.
- A support ticket is opened, the deployment is rolled back, but damage is already done.
Even a single failure can be catastrophic for authentication or payment systems.
Canary Deployments
Canary deployments do not rely on randomness. You explicitly specify the traffic share for the new version, e.g.:
- “Only 5 % of traffic goes to v2.”
Out of 1 000 requests:
- 950 → v1
- 50 → v2
This is enforced routing, not “maybe” or “if lucky”.
How it is implemented
- Ingress rules with weighted paths
- Load‑balancer weight settings
- Service‑mesh traffic splitting
Replica‑based canary (not true percentage)
4 v1 pods
1 v2 pod
Using replicas alone only reduces probability; it does not guarantee traffic limits. Production environments therefore use explicit traffic weighting.
Benefits
- If a v2 pod hits an external timeout (e.g., Stripe API), only the small canary traffic is affected.
- Latency spikes are detected early; the canary can be stopped, keeping 99 % of customers safe.
Comparison: Rolling Update vs. Canary
| Aspect | Rolling Update | Canary |
|---|---|---|
| Who gets new version? | Anyone (random) | Only selected traffic (controlled) |
| Traffic share | Random distribution | Controlled percentage |
| Risk size | Unknown | Known & limited |
| Rollback damage | Already happened (may be large) | Minimal (limited exposure) |
| Typical use case | Safe, low‑risk changes | Risky or high‑impact changes |
Readiness Probes
A readiness probe answers a single question:
“Should Kubernetes send traffic to this pod?”
It checks only technical availability (process alive, port open, HTTP 200, etc.). It does not verify business logic, latency, external dependencies, or correctness.
Typical probe definitions
readinessProbe:
httpGet:
path: /health
port: 8080
# or
readinessProbe:
tcpSocket:
port: 8080
- Probe passes → pod is marked Ready and added to the Service.
- Probe fails → pod is removed from the Service.
A pod can return 200 OK on /health while its core functionality (e.g., payments, Kafka consumption) is broken.
How a Canary Deployment Works
- Readiness – Pod passes the readiness probe → becomes eligible for traffic.
- Canary traffic (e.g., 5 %) – Requests are routed to the new version.
- Monitoring – Real‑user behavior, latency, error rates, and external system responses are observed.
- Decision
- If metrics are healthy → increase traffic share.
- If errors exceed thresholds → stop the canary, keep the stable version.
Without a canary, 100 % of traffic would be exposed to any defect.
Interview‑style Summary
- Readiness probes verify that a pod is technically ready to receive traffic.
- Canary deployments validate that the new version is safe for users by exposing a controlled slice of real traffic and observing behavior.
Both mechanisms are complementary:
- Readiness protects the Kubernetes control plane from sending traffic to a pod that isn’t running.
- Canary protects the business and users from functional regressions.
Think of it as:
- Readiness = “Is the engine started?”
- Canary = “Is the car safe to drive at highway speeds?”