Incident Commander 역할: 혼란 없이 인시던트 운영

발행: 5시간 전 (2026년 4월 21일 PM 04:33 GMT+9)

5 분 소요

원문: Dev.to

Source: Dev.to

(번역을 진행하려면 번역하고자 하는 전체 텍스트를 제공해 주세요.)

모두가 디버깅하고, 리더는 없다

인시던트 채널에 다섯 명의 엔지니어가 각각 독립적으로 디버깅하고 있습니다. 전혀 조율이 없습니다. 같은 대시보드를 세 명이 확인하고, 두 명은 서로 충돌하는 수정을 시도합니다. 고객은 기다리고 있습니다.

이것이 인시던트 커맨더(Incident Commander, IC) 없이 인시던트가 진행되는 모습입니다. IC는 디버깅을 하지 않고, 조율합니다.

Incident Commander (IC) Responsibilities

사고 심각도 선언
역할 할당 (디버거, 커뮤니케이터, 서기)
조사 흐름 조정
결정 내리기 (롤백? 에스컬레이션? 대기?)
커뮤니케이션 관리 (상태 페이지, 이해관계자)
필요 시 도움 요청
전면 해제 선언

IC가 하지 않는 일

코드 작성
쿼리 실행
서버에 SSH 접속
문제 디버깅

사고 대응 워크플로우

페이지 확인
인시던트 채널 열기: #inc-YYYY-MM-DD-description

심각도 선언 게시

I'm IC for this incident.
Severity: P1 - Customer-facing checkout is down
Impact: ~30% of checkout attempts failing

Roles:
- @alice: Primary debugger
- @bob: Comms (status page + Slack updates)
- @charlie: Scribe (timeline)

First actions:
- @alice: Check last deploy and error logs
- @bob: Post initial status page update
- I'll update every 10 minutes.

구조화된 조사 루프 (5분마다)

“@alice, 무엇을 찾았나요?”
정보를 종합하기
다음 행동 결정하기
다음 작업 할당하기
채널 업데이트: “현재 가설: [X]. 테스트: [Y].”

def ic_decision_tree(situation):
    if situation.root_cause_known:
        if situation.fix_available:
            return "Deploy fix with canary"
        else:
            return "Rollback to last known good"

    if situation.duration > 15 and not situation.making_progress:
        return "Escalate: bring in additional expertise"

    if situation.customer_impact_growing:
        return "Escalate severity + enable fallback"

    return "Continue investigation, update in 5 min"

사전 작성된 템플릿

내부 업데이트

format: |
  **Incident Update [{severity}] {time} UTC**
  Status: {investigating|identified|monitoring|resolved}
  Impact: {impact_description}
  Current action: {what_we_are_doing}
  Next update: {time_of_next_update}

상태 페이지 업데이트

format: |
  We are {status} an issue affecting {service}.
  Some users may experience {symptom}.
  Our team is actively working on a resolution.
  Next update in {minutes} minutes.

경영진 에스컬레이션

format: |
  P1 Incident: {title}
  Duration: {duration} minutes
  Customer impact: {impact}
  Revenue impact: ~${revenue}/hour
  Current status: {status}
  ETA to resolution: {eta}

Training the ICs (Game Days)

Week 1: 게임 데이에 경험이 풍부한 IC를 그림자처럼 따라다니기
Week 2: 시뮬레이션된 P2 사고를 IC 수행 (게임 데이)
Week 3: 시뮬레이션된 P1 사고를 IC 수행 (게임 데이)
Week 4: 멘토가 관찰하는 가운데 실제 P3/P4 사고를 IC 수행
Week 5+: 모든 심각도에 대한 IC 로테이션

IC 순환

ic_rotation:
  schedule: weekly
  pool_size: 6  # Minimum for sustainable rotation
  requirements:
    - Completed IC training program
    - At least 6 months on the team
    - Shadowed 3+ real incidents
  compensation:
    - Same as on‑call compensation
    - IC counts as on‑call time

지표 비교

Metric	IC 없이	IC 포함
MTTR (P1)	67 min	28 min
Communication gaps	자주	드물게
Duplicate work	~40 %	~5 %
Stakeholder satisfaction	낮음	높음
Post‑mortem quality	불완전	철저함

요약

IC가 더 똑똑해서 사고를 짧게 만드는 것이 아니라, 실제로 누군가가 관리하는 응답 덕분에 사고가 짧아집니다.

모든 엔지니어를 효과적인 IC로 만드는 AI‑지원 사고 조정을 원한다면, Nova AI Ops에서 우리가 구축하고 있는 것을 확인해 보세요: