인시던트 대응 및 블레임 없는 포스트모템: 더 나은 런북 및 SLO/SLI 정의 작성

발행: 3개월 전 (2026년 2월 4일 오전 02:02 GMT+9)

7 분 소요

원문: Dev.to

Source: Dev.to

What We Learned

실제로 중요한 SLO를 정의하는 방법
사용되는 런북을 작성하는 방법
혼란 없이 인시던트를 운영하는 방법
재발을 방지하는 블레임 없는 사후 분석을 수행하는 방법

Source: …

SLO와 SLI

대부분의 팀은 SLO가 없거나 “가짜” SLO를 가지고 있습니다—사용자 경험이나 엔지니어링 결정과 연결되지 않은 공중에서 뽑은 숫자들입니다. 좋은 SLO는 작업 우선순위를 바꿔줍니다:

건강한 오류 예산 → 기능 출시
소진되는 오류 예산 → 신뢰성에 집중

서비스 수준 지표(SLI)는 서버 상태만이 아니라 사용자 경험을 반영해야 합니다.

나쁜 SLI

CPU 사용률
메모리 사용량
실행 중인 파드 수

좋은 SLI

요청 성공률 (비5xx / 전체)
p99 요청 지연 시간
데이터 최신성

예시 Prometheus 쿼리

# Availability SLI
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# Latency SLI (p99)
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
/
sum(rate(http_requests_total[5m]))

팁: 처음에는 완화된 목표를 설정하고 나중에 점점 엄격하게 조정하세요. 사용자는 개별 서비스가 아니라 엔드‑투‑엔드 여정을 중요하게 생각합니다.

여정 중심 SLO 정의 (YAML)

journeys:
  - name: checkout
    slo:
      availability: 99.95%
      latency_p99: 3s
    components:
      - api-gateway
      - auth-service
      - cart-service
      - inventory-service
      - payments-service
      - order-service
    measurement:
      endpoint: /api/v1/checkout/health
      rum_event: checkout_completed

오류 예산 임계값 조치

thresholds:
  - budget_remaining: 50%
    actions:
      - notify: slack
  - budget_remaining: 25%
    actions:
      - freeze: non_critical_deployments
  - budget_remaining: 10%
    actions:
      - freeze: all_deployments
      - meeting: reliability_review
  - budget_remaining: 0%
    actions:
      - focus: reliability_only

SLO 구성 예시

slos:
  - name: requests-availability
    objective: 99.95
    sli:
      events:
        error_query: sum(rate(http_requests_total{status=~"5.."}[{{.window}}]))
        total_query: sum(rate(http_requests_total[{{.window}}]))

효과적인 런북 작성

좋은 런북은:

스캔 가능 – 압박 상황에서도 빠르게 훑어볼 수 있음
실행 가능 – 명확한 다음 단계
테스트됨 – 스테이징이나 연습에서 검증됨

예시 런북: Payments API 높은 오류 비율

감지

Alert: PaymentsAPIHighErrorRate

1단계: 제공자 상태 확인

curl -s https://status.stripe.com/api/v2/summary.json

2단계: 최근 배포 검토

kubectl rollout history deployment/payments-api

3단계: 필요 시 롤백

kubectl rollout undo deployment/payments-api

4단계: DB 연결 풀 검사

curl http://payments-api/debug/metrics | grep db_pool

자동화 테스트 (PHP)

public function testDatabaseConnectionExhaustionRunbook(): void
{
    // Simulate connection exhaustion
    $this->simulateDbPoolExhaustion();

    // Verify alert condition
    $metrics = $this->fetchMetrics('/debug/metrics');
    $this->assertLessThan(5, $metrics['db_pool_available']);

    // Apply mitigation
    $this->scaleServiceReplicas(10);

    // Verify recovery
    $this->assertTrue($this->serviceRecovered());
}

인시던트 역할

Role	Responsibility
Incident Commander	대응을 조정합니다
Tech Lead	디버깅 작업을 이끕니다
Comms Lead	이해관계자와의 커뮤니케이션을 담당합니다

심각도 수준

SEV1 – 완전한 서비스 중단 또는 데이터 손실
SEV2 – 주요 성능 저하
SEV3 – 사소한 영향

실제 인시던트 예시

🔴 INCIDENT: 결제 오류
Severity: SEV2
Impact: 성공률 82 %

Role	Owner
IC	@alice
Tech	@bob
Comms	@carol

Timeline

14:32 – 알림 발생
14:40 – Stripe가 503 응답 반환
14:45 – 회로 차단기 작동
15:15 – 해결됨

인시던트 봇 (PHP)

class IncidentBot
{
    public function declareIncident(array $data): Incident
    {
        $incident = Incident::create([
            'title'    => $data['title'],
            'severity' => $data['severity'],
            'status'   => 'investigating',
        ]);

        $this->createSlackChannel($incident);
        $this->notifyPagerDuty($incident);

        return $incident;
    }

    public function resolveIncident(Incident $incident): void
    {
        $incident->update(['status' => 'resolved']);
        $this->schedulePostmortem($incident);
    }
}

사후 보고 요약

요약 – 체크아웃이 43 분 동안 성능이 저하되었습니다.

근본 원인 – 서킷‑브레이커 임계값이 너무 높게 설정되었습니다.

작업 항목

작업	담당자	마감일
임계값 낮추기	@bob	1월 22일
알림 추가	@alice	1월 23일

작업‑항목 트래커 (PHP)

class ActionItemTracker
{
    public function weeklyDigest(): void
    {
        $overdue = ActionItem::overdue()->get()->groupBy('owner');

        foreach ($overdue as $owner => $items) {
            $this->notifyOwner($owner, $items);
        }
    }
}

전후 지표

지표	이전	이후
MTTR	4 시간	35 분
반복 사고	4/q	1/q
남은 오류 예산	12 %	58 %

신뢰성 분야

SLO는 문제가 발생했을 때 알려줍니다
런북은 문제 해결을 도와줍니다
인시던트 역할은 혼란을 방지합니다
포스트모템은 재발을 방지합니다

주요 요점

SLO는 개별 서비스가 아니라 사용자 여정을 기준으로 정의합니다.
오류 예산을 활용해 우선순위 결정을 안내합니다.
스캔하기 쉽고 실행 가능하며 정기적으로 테스트되는 런북을 작성합니다.
인시던트를 명확한 역할과 커뮤니케이션 채널을 갖춘 구조화된 형태로 유지합니다.
비난 없는 포스트모템을 수행하고 조치 항목을 꾸준히 추적합니다.