Question for teams doing chaos engineering: how do you choose experiment targets?
Source: Dev.to
Question
While working on a side project related to service reliability, I ran into a question that I’m curious about from people actually running chaos experiments.
Most chaos engineering discussions focus on the types of experiments (latency injection, pod failure, network faults, etc.).
But something less obvious is how teams choose where to run experiments in the first place. In a system with many microservices, there are lots of possible targets.
How Do Teams Choose Experiment Targets?
Do teams typically:
- rotate through services over time
- prioritize ones that caused incidents
- focus on critical dependency paths
- rely on platform/SRE intuition
- something else?
I’m interested to hear how this works in real environments.