SLIs, SLOs, SLAs: SRE의 비밀 소스 가이드

발행: 2개월 전 (2026년 2월 11일 오전 10:53 GMT+9)

3 분 소요

원문: Dev.to

Source: Dev.to

If you ever want to be an SRE—a real site reliability wizard—you have to speak the language of the trade. It isn’t “install Prometheus” or “deploy Kubernetes.” It’s SLIs, SLOs, SLAs, and Error Budgets—the holy trinity of keeping services alive and keeping the boss off your back.

Service Level Indicator (SLI)

An SLI is the “street‑level gossip” of your service: it tells you how the service actually behaves from the user’s point of view, not from some nerdy server graph.

Latency SLI – How fast does your social‑media feed load for a user?
Error‑rate SLI – How many posts fail to load or error out?
Availability SLI – How often is your API completely unavailable?

Users don’t care about CPU load, memory usage, or thread pools. Those metrics are irrelevant. SLIs are the numbers that matter to humans; they’re your reality check. Think of SLIs as the pulse of your service—when the pulse drops, trouble is coming.

Service Level Objective (SLO)

An SLO is the promise you make to yourself (or your team) about what’s acceptable.

Example 1: 99.9 % of requests to your checkout API should complete in under 500 ms.
Example 2: 99 % of posts in the social‑media feed should load correctly on the first try.

This isn’t about perfection; it’s about “good enough.” Trying to hit 100 % uptime is prohibitively expensive. Nobody cares about perfection; SREs care about manageable reliability.

Service Level Agreement (SLA)

An SLA is the legal contract you make with paying users. If you fail, users can demand refunds or penalties.

Example 1: “If checkout API availability drops below 99.5 % in a month, we refund the transaction fee.”
Example 2: “If social‑media feed errors exceed 0.5 % for the month, we compensate premium users.”

SLAs are the adult version of your SLOs—now lawyers are watching. Your internal metrics (SLIs, SLOs) are tools to avoid SLA violations.

Error Budgets

Every SLO comes with an error budget.

Example: An SLO of 99.9 % of checkout requests < 500 ms gives you a 0.1 % error budget. That 0.1 % is the amount of failure you can tolerate before you’re in trouble.

Error budgets are decision‑making tools:

Hit your error budget? Stop risky deployments and focus on stability.
Well within your error budget? Push new features and take calculated risks.

Error budgets let you balance velocity with reliability, turning firefighting into smart deployment decisions.

Core Truths

Concept	Meaning
SLI	How messed up is it right now?
SLO	How messed up is okay?
Error Budget	How much failure can I tolerate before flipping out?
SLA	How much messing around can get me sued?

Why You Give a Damn as an SRE

Measure first, fix second.
Focus on user‑visible metrics. CPU spikes are irrelevant; latency and error rates are everything.
Accept failures. Systems break, but an error budget lets you survive and deploy fast.
Automate prevention. Repeating firefighting is for suckers.

SLIs, SLOs, SLAs: SRE의 비밀 소스 가이드

Service Level Indicator (SLI)

Service Level Objective (SLO)

Service Level Agreement (SLA)

Error Budgets

Core Truths

Why You Give a Damn as an SRE

관련 글

왜 당신의 AI Coding Agent는 비용이 기하급수적으로 증가하는가 (그리고 이를 해결하는 방법)

Amazon Bedrock AgentCore 게이트웨이를 (오직 CloudFront를 통해서만) 접근 가능하게 만들기

Google Cloud에서 Event-Driven Architecture 재정의

당신의 휴대폰은 이미 사진이 진짜임을 증명할 하드웨어를 가지고 있습니다. 아무도 사용하지 않아요.