SRE Weekly Issue #514
Source: SRE Weekly
Benjamin Barton — Datadog
Finally! Someone actually explaining how they test their SRE agent. Having a testing methodology is table stakes. Showing their work helps us decide whether we can trust the tool. With so many SRE agents floating around, it’s quite surprising to me that this kind of article is so rare.
Patrick Reynolds — PlanetScale
An enlightening deep dive into the way this Postgres resource management system evaluates the cost of queries in order to shed resource‑intensive ones.
Art Kondratiev — Uptime Labs
If you’ve ever been in an incident where communication suddenly went quiet and access got restricted, this article explains why. The author breaks down five fundamental ways security incident response diverges from outage response — and why the instincts that make you effective at one can actively work against you in the other.
Oreoluwa Omoike — DZone
Security and reliability are inexorably intertwined. Examples: reliability failures leave security temporarily weak and vulnerable, and security changes have caused a number of recent high‑profile outages.
Ankush Madaan — DZone
Some timely reminders about the realities of how autoscaling actually works in Kubernetes. It’s all about tuning your mental model.
David Iyanu Jonathan — DZone
There’s a limit to how far parallelism can get you, and it’s down to what part of your workload is by necessity serial. In practice, microservices that share a database or coordinate on every request are a distributed monolith with extra latency and a much harder debugging story.
Parveen Saini — DZone
This is a great story, and I really liked the section on why traditional reliability techniques (autoscaling, circuit breakers, and rate limits) weren’t enough.