Incident Commander 角色：在不混乱的情况下处理事件

发布: 2天前 (2026年4月21日 GMT+8 15:33)

4 分钟阅读

原文: Dev.to

Source: Dev.to

每个人都在调试，却没有人领头

在事故频道里有五名工程师，全部各自独立调试。没有协调。三个人查看同一个仪表盘，两个尝试相互冲突的修复。客户在等待。

这就是没有事故指挥官（IC）时事故的样子。IC 不会调试；他们协调。

事件指挥官（IC）职责

宣布事件严重程度
分配角色（调试员、沟通员、记录员）
协调调查工作流
做出决策（回滚？升级？等待？）
管理沟通（状态页面、利益相关者）
在需要时请求帮助
宣布全部清除

IC 不会做的事情

编写代码
运行查询
SSH 到服务器
调试问题

事件响应工作流程

确认页面
打开事件渠道: #inc-YYYY-MM-DD-description

发布严重性声明

I'm IC for this incident.
Severity: P1 - Customer-facing checkout is down
Impact: ~30% of checkout attempts failing

Roles:
- @alice: Primary debugger
- @bob: Comms (status page + Slack updates)
- @charlie: Scribe (timeline)

First actions:
- @alice: Check last deploy and error logs
- @bob: Post initial status page update
- I'll update every 10 minutes.

结构化调查循环（每5分钟）

“@alice，你发现了什么？”
综合信息
决定下一步行动
分配下一个任务
更新渠道：“当前理论：[X]。测试：[Y]。”

def ic_decision_tree(situation):
    if situation.root_cause_known:
        if situation.fix_available:
            return "Deploy fix with canary"
        else:
            return "Rollback to last known good"

    if situation.duration > 15 and not situation.making_progress:
        return "Escalate: bring in additional expertise"

    if situation.customer_impact_growing:
        return "Escalate severity + enable fallback"

    return "Continue investigation, update in 5 min"

预先编写的模板

内部更新

format: |
  **Incident Update [{severity}] {time} UTC**
  Status: {investigating|identified|monitoring|resolved}
  Impact: {impact_description}
  Current action: {what_we_are_doing}
  Next update: {time_of_next_update}

状态页面更新

format: |
  We are {status} an issue affecting {service}.
  Some users may experience {symptom}.
  Our team is actively working on a resolution.
  Next update in {minutes} minutes.

高层升级

format: |
  P1 Incident: {title}
  Duration: {duration} minutes
  Customer impact: {impact}
  Revenue impact: ~${revenue}/hour
  Current status: {status}
  ETA to resolution: {eta}

培训值班指挥官（游戏日）

第 1 周：在游戏日跟随经验丰富的值班指挥官学习
第 2 周：指挥一次模拟的 P2 事件（游戏日）
第 3 周：指挥一次模拟的 P1 事件（游戏日）
第 4 周：在导师观察下指挥一次真实的 P3/P4 事件
第 5 周+：针对所有严重程度进行指挥轮换

IC 轮换

ic_rotation:
  schedule: weekly
  pool_size: 6  # Minimum for sustainable rotation
  requirements:
    - Completed IC training program
    - At least 6 months on the team
    - Shadowed 3+ real incidents
  compensation:
    - Same as on‑call compensation
    - IC counts as on‑call time

指标比较

指标	未使用 IC	使用 IC
MTTR（P1）	67 分钟	28 分钟
沟通差距	经常	很少
重复工作	~40 %	~5 %
利益相关者满意度	低	高
事后分析质量	不完整	彻底

要点

IC 并不是因为更聪明而让事件更快结束；而是因为有人真正管理响应，才让事件更快结束。

如果你想要 AI 辅助的事件协调，使每位工程师都能成为高效的 IC，请查看我们在 Nova AI Ops 正在构建的内容：