SLIs, SLOs, SLAs: SRE의 비밀 소스 가이드

발행: (2026년 2월 11일 오전 10:53 GMT+9)
3 분 소요
원문: Dev.to

Source: Dev.to

If you ever want to be an SRE—a real site reliability wizard—you have to speak the language of the trade. It isn’t “install Prometheus” or “deploy Kubernetes.” It’s SLIs, SLOs, SLAs, and Error Budgets—the holy trinity of keeping services alive and keeping the boss off your back.

Service Level Indicator (SLI)

An SLI is the “street‑level gossip” of your service: it tells you how the service actually behaves from the user’s point of view, not from some nerdy server graph.

  • Latency SLI – How fast does your social‑media feed load for a user?
  • Error‑rate SLI – How many posts fail to load or error out?
  • Availability SLI – How often is your API completely unavailable?

Users don’t care about CPU load, memory usage, or thread pools. Those metrics are irrelevant. SLIs are the numbers that matter to humans; they’re your reality check. Think of SLIs as the pulse of your service—when the pulse drops, trouble is coming.

Service Level Objective (SLO)

An SLO is the promise you make to yourself (or your team) about what’s acceptable.

  • Example 1: 99.9 % of requests to your checkout API should complete in under 500 ms.
  • Example 2: 99 % of posts in the social‑media feed should load correctly on the first try.

This isn’t about perfection; it’s about “good enough.” Trying to hit 100 % uptime is prohibitively expensive. Nobody cares about perfection; SREs care about manageable reliability.

Service Level Agreement (SLA)

An SLA is the legal contract you make with paying users. If you fail, users can demand refunds or penalties.

  • Example 1: “If checkout API availability drops below 99.5 % in a month, we refund the transaction fee.”
  • Example 2: “If social‑media feed errors exceed 0.5 % for the month, we compensate premium users.”

SLAs are the adult version of your SLOs—now lawyers are watching. Your internal metrics (SLIs, SLOs) are tools to avoid SLA violations.

Error Budgets

Every SLO comes with an error budget.

  • Example: An SLO of 99.9 % of checkout requests < 500 ms gives you a 0.1 % error budget. That 0.1 % is the amount of failure you can tolerate before you’re in trouble.

Error budgets are decision‑making tools:

  • Hit your error budget? Stop risky deployments and focus on stability.
  • Well within your error budget? Push new features and take calculated risks.

Error budgets let you balance velocity with reliability, turning firefighting into smart deployment decisions.

Core Truths

ConceptMeaning
SLIHow messed up is it right now?
SLOHow messed up is okay?
Error BudgetHow much failure can I tolerate before flipping out?
SLAHow much messing around can get me sued?

Why You Give a Damn as an SRE

  • Measure first, fix second.
  • Focus on user‑visible metrics. CPU spikes are irrelevant; latency and error rates are everything.
  • Accept failures. Systems break, but an error budget lets you survive and deploy fast.
  • Automate prevention. Repeating firefighting is for suckers.
0 조회
Back to Blog

관련 글

더 보기 »

bilingual_pdf, @rudifa가 만든 앱

설명: 다른 인간 언어를 배우고 있다면, 자신이 아는 언어의 텍스트와 그 번역이 포함된 bilingual documents를 만들고 싶을 수도 있습니다...