SRE is the BEST Thing Ever

Published: 3 months ago (January 30, 2026 at 09:38 AM EST)

5 min read

Source: Dev.to

Source: Dev.to

If you don’t know what SRE is, don’t worry… I’ve got you.

I’m Jairo Jr., Software Engineer at Mercado Livre, based in Brazil, and over the last few months I’ve been studying SRE.

SRE changed the way I see production.

Before SRE, production for me was:

✅ “deploy is done”
✅ “feature is working”
✅ “let’s move to the next ticket”

After diving deeper into SRE, I started to see production as:

“Ok… but what if this breaks at 3 AM?”

So… what is SRE?

SRE stands for Site Reliability Engineering. It originated at Google around 2003 when they realized a simple truth:

If your product grows, your problems grow too.

The real pain isn’t just having problems (every system does). It’s dealing with:

problems happening every week
broken user experience
stressed people
on‑call turning into a nightmare
the company losing money while you debug logs like a detective 🕵️‍♂️

SRE is an approach to make software:

✅ scalable
✅ reliable
✅ measurable
✅ less “random”

SRE is not just DevOps with a fancy name

Many think “SRE is DevOps, right?” Not exactly. SRE combines DevOps goals with an engineering mindset.

Instead of solving problems manually forever, SRE asks:

“Can we automate this?”
“Can we predict this?”
“Can we detect it before the customer does?”
“Can we recover faster?”

It shifts the discipline from reactive to proactive.

The day‑to‑day example (the real one)

Imagine this classic situation:

You deploy a new feature.
Everything looks OK.
Ten minutes later:
- latency spikes 📈
- some requests start failing
- your metrics dashboard looks like a Christmas tree 🎄

You might hear the magic sentence: “For me it’s working…”, but for users:

❌ it’s not.

SRE helps you answer quickly:

What is broken?
When did it start?
Is it affecting everyone or just some customers?
Is the failure in my service or a dependency?
What changed?
How fast can I roll back?

Without SRE culture, you usually discover issues via:

Slack messages
Customer complaints
A manager asking “what is happening?” 😭

SRE teaches you to measure reliability

SRE is grounded in numbers. Instead of saying:

❌ “Our service is very stable”

you say:

✅ “Our service is stable because our SLO is 99.9 % and we are within the error budget.”

This makes reliability a concrete conversation with:

engineers
product teams
managers
business stakeholders

Everyone understands it.

SLO and Error Budget (the part that hits different)

If your SLO is 99.9 % availability per month, you can afford roughly 43 minutes of downtime per month. This is your error budget.

Rule of thumb:

✅ Inside the budget → you can deploy more.
❌ Budget exhausted → stop pushing risky changes and focus on reliability.

In other words, SRE says: “Move fast… but not stupid fast.”

SRE is the reason you stop being a hero

Without SRE, a common culture emerges:

Something breaks.
One person wakes up, fixes everything, and becomes the “hero”.

That trap makes the system dependent on a human. Humans:

get tired
make mistakes
get sick
take vacations
change jobs

SRE pushes you to build systems that don’t need heroes. It’s not about “who can fix faster” but about:

✅ Why this happened
✅ What we improve
✅ How we avoid it again
✅ How we reduce impact next time

On‑call is not the problem (bad on‑call is)

On‑call is part of the game, but bad on‑call kills teams. Bad on‑call looks like:

Useless alerts that page you constantly
No runbooks
No clear dashboards
No rollback plan
No ownership
Every incident feels like the first time

Good SRE makes on‑call easier by forcing the team to build:

Clear monitoring
Meaningful alerts
Fast recovery procedures
A solid incident process

Thus the team moves from “panic mode” to “process mode”.

The real goal: protect your users

At the end of the day, users don’t care about:

Kubernetes, Kafka, retries, p95 latency, cache invalidation

They care about:

✅ The app works
✅ Payments go through
✅ Screens load fast
✅ Orders are confirmed

SRE helps you deliver that every day, not just on your local machine.

Why I think SRE is the BEST thing ever

Because it changes your mindset.

SRE makes you stop thinking only about:

🧩 “How to build this feature”

and start thinking about:

🔥 “How to keep this feature alive in production for millions of users”

And that’s a different level of engineering.

SRE is the BEST Thing Ever

So… what is SRE?

SRE is not just DevOps with a fancy name

The day‑to‑day example (the real one)

SRE teaches you to measure reliability

SLO and Error Budget (the part that hits different)

SRE is the reason you stop being a hero

On‑call is not the problem (bad on‑call is)

The real goal: protect your users

Why I think SRE is the BEST thing ever

Related posts

Cognitive Load-Aware DevOps: Improving SRE Reliability

SRE Weekly Issue #508

Managing Test Accounts Effectively During High Traffic Events with DevOps

The Four Knobs of AI Agent Reliability: A DevOps Perspective