Chaos Engineering: Breaking Production on Purpose (So It Never Breaks Again)
Source: Dev.to
The 2 AM Pager Story
It’s 2:07 AM.
🚨 Production Down – High Error Rate
All the right boxes were checked, the system is “highly available,” yet a single node died.
That night teaches a painful truth: a system that looks reliable on paper doesn’t necessarily survive real‑world failure.
What is Chaos Engineering?
Chaos Engineering is the practice of intentionally breaking things to learn how your system behaves under failure.
In simple terms: create controlled failures on your terms—preferably during safe windows—so production doesn’t teach you lessons at 2 AM.
Why It Exists
- Modern systems are distributed.
- Failures are inevitable.
- Humans are bad at predicting edge cases.
Chaos Engineering accepts reality instead of fighting it.
Limitations of Traditional Testing
| Traditional testing assumes… | Reality shows… |
|---|---|
| Dependencies behave normally | Databases slow down, not just crash |
| Networks are reliable | Networks lie, latency spikes |
| Latency is predictable | Third‑party APIs timeout randomly |
| Partial failures won’t cascade | Distributed systems fail in creative ways |
Most outages stem from unknown unknowns, not code bugs. Chaos Engineering uncovers those unknowns before users do.
Defining a “Healthy” System
- Request success rate
- Latency percentiles
- Error budgets
- Business KPIs
Without clear steady‑state metrics, you’re just breaking stuff blindly.
Real‑World Failure Types (Not Mocks or Simulations)
- Killing pods
- Adding latency
- Breaking network calls
- Throttling CPU
These are performed in production with real traffic, real data, and real chaos—gradually, during safe windows, with rollback plans and scheduled downtimes.
Example Experiments
# Kill a pod
kubectl delete pod payment-service-xyz
- Add 500 ms latency between services
- Drop 10 % of packets
Exposes: retry storms, timeout misconfigurations, database slowdowns, Redis unavailability, third‑party 500 errors, CPU throttling, memory pressure, disk full, etc.
How to Run Chaos Experiments
- Pick a critical service
- Define steady‑state metrics (success rate, latency, error budget)
- Start in non‑prod
- Kill a single pod
- Observe everything
- Fix weaknesses
- Repeat and expand
- Move to prod during low traffic
- Add latency to a dependency
- Simulate DB slowness
Small, controlled chaos beats no chaos at all.
Common Chaos Tools
- Chaos Monkey – the original chaos tool
- LitmusChaos – Kubernetes‑native, open source
- Gremlin – controlled, enterprise‑grade chaos
- AWS FIS – native AWS fault injection
Tools don’t do chaos engineering; the mindset does.
Example Scenario (Kubernetes + Spring Boot)
- Stack: Java Spring Boot microservice, Kubernetes (EKS), HPA enabled, Redis cache, PostgreSQL DB
- Chaos: Kill 50 % of pods during peak traffic
Observed failures:
- Connection pool exhaustion
- Hammered DB retries
- Latency spikes beyond SLA
- No circuit breaker, aggressive retries, poor timeout config
Mitigations added:
- Resilience4j (circuit breaker, tuned retries & timeouts)
- Improved readiness probes
Result: The system survived the chaos—proof that chaos engineering works when applied responsibly.
Principles of Chaos Engineering
- Hypothesis‑driven – start with a clear hypothesis about steady state.
- Measured – collect metrics before, during, and after the experiment.
- Reversible – ensure you can roll back quickly.
- Randomness with purpose – avoid “random breaking” that is merely bad ops.
You need solid monitoring, easy rollback, error budgets, and an on‑call team that understands the system. Without observability, chaos is just noise.
Benefits
- Fewer production outages
- Faster incident response
- Safer deployments
- Better system design
- More confident on‑call engineers
You stop hoping things work and know they do.
Getting Started
- Select a low‑risk target (e.g., a non‑critical microservice).
- Define steady‑state metrics (success rate, latency, error budget).
- Run a small experiment (kill one pod, add latency).
- Observe, learn, and iterate.
Gradually increase scope and move experiments into production during low‑traffic windows.
Closing Thought
If you killed one thing in your production system today, what would break first?
Drop your thoughts, war stories, or doubts in the comments—let’s learn from each other before the pager rings again.