SRE is the BEST Thing Ever

Published: (January 30, 2026 at 09:38 AM EST)
4 min read
Source: Dev.to

Source: Dev.to

If you don’t know what SRE is, don’t worry… I’ve got you.

I’m Jairo Jr., Software Engineer at Mercado Livre, based in Brazil, and over the last few months I’ve been studying SRE.

SRE changed the way I see production.

Before SRE, production for me was:

  • ✅ “deploy is done”
  • ✅ “feature is working”
  • ✅ “let’s move to the next ticket”

After diving deeper into SRE, I started to see production as:

“Ok… but what if this breaks at 3 AM?”

So… what is SRE?

SRE stands for Site Reliability Engineering. It originated at Google around 2003 when they realized a simple truth:

If your product grows, your problems grow too.

The real pain isn’t just having problems (every system does). It’s dealing with:

  • problems happening every week
  • broken user experience
  • stressed people
  • on‑call turning into a nightmare
  • the company losing money while you debug logs like a detective 🕵️‍♂️

SRE is an approach to make software:

  • ✅ scalable
  • ✅ reliable
  • ✅ measurable
  • ✅ less “random”

SRE is not just DevOps with a fancy name

Many think “SRE is DevOps, right?” Not exactly. SRE combines DevOps goals with an engineering mindset.

Instead of solving problems manually forever, SRE asks:

  • “Can we automate this?”
  • “Can we predict this?”
  • “Can we detect it before the customer does?”
  • “Can we recover faster?”

It shifts the discipline from reactive to proactive.

The day‑to‑day example (the real one)

Imagine this classic situation:

  1. You deploy a new feature.
  2. Everything looks OK.
  3. Ten minutes later:
    • latency spikes 📈
    • some requests start failing
    • your metrics dashboard looks like a Christmas tree 🎄

You might hear the magic sentence: “For me it’s working…”, but for users:

❌ it’s not.

SRE helps you answer quickly:

  • What is broken?
  • When did it start?
  • Is it affecting everyone or just some customers?
  • Is the failure in my service or a dependency?
  • What changed?
  • How fast can I roll back?

Without SRE culture, you usually discover issues via:

  • Slack messages
  • Customer complaints
  • A manager asking “what is happening?” 😭

SRE teaches you to measure reliability

SRE is grounded in numbers. Instead of saying:

❌ “Our service is very stable”

you say:

✅ “Our service is stable because our SLO is 99.9 % and we are within the error budget.”

This makes reliability a concrete conversation with:

  • engineers
  • product teams
  • managers
  • business stakeholders

Everyone understands it.

SLO and Error Budget (the part that hits different)

If your SLO is 99.9 % availability per month, you can afford roughly 43 minutes of downtime per month. This is your error budget.

Rule of thumb:

  • Inside the budget → you can deploy more.
  • Budget exhausted → stop pushing risky changes and focus on reliability.

In other words, SRE says: “Move fast… but not stupid fast.”

SRE is the reason you stop being a hero

Without SRE, a common culture emerges:

  • Something breaks.
  • One person wakes up, fixes everything, and becomes the “hero”.

That trap makes the system dependent on a human. Humans:

  • get tired
  • make mistakes
  • get sick
  • take vacations
  • change jobs

SRE pushes you to build systems that don’t need heroes. It’s not about “who can fix faster” but about:

  • ✅ Why this happened
  • ✅ What we improve
  • ✅ How we avoid it again
  • ✅ How we reduce impact next time

On‑call is not the problem (bad on‑call is)

On‑call is part of the game, but bad on‑call kills teams. Bad on‑call looks like:

  • Useless alerts that page you constantly
  • No runbooks
  • No clear dashboards
  • No rollback plan
  • No ownership
  • Every incident feels like the first time

Good SRE makes on‑call easier by forcing the team to build:

  • Clear monitoring
  • Meaningful alerts
  • Fast recovery procedures
  • A solid incident process

Thus the team moves from “panic mode” to “process mode”.

The real goal: protect your users

At the end of the day, users don’t care about:

  • Kubernetes, Kafka, retries, p95 latency, cache invalidation

They care about:

  • ✅ The app works
  • ✅ Payments go through
  • ✅ Screens load fast
  • ✅ Orders are confirmed

SRE helps you deliver that every day, not just on your local machine.

Why I think SRE is the BEST thing ever

Because it changes your mindset.

SRE makes you stop thinking only about:

🧩 “How to build this feature”

and start thinking about:

🔥 “How to keep this feature alive in production for millions of users”

And that’s a different level of engineering.

Back to Blog

Related posts

Read more »

Design Secure Access To AWS Resources

Exam Guide: Solutions Architect – Associate 🛡️ Domain 1 – Design Secure Architectures 📘 Task Statement 1.1 > Secure access means you can clearly answer the f...

34.Copy Data to S3 Using Terraform

Lab Information The Nautilus DevOps team is currently performing data migrations, moving data from on‑premise storage systems to AWS S3 buckets. They have rece...