The Incident Commander Role: Running Incidents Without Chaos
Source: Dev.to
Everyone’s Debugging, Nobody’s Leading
Five engineers in an incident channel, all debugging independently. No coordination. Three people checking the same dashboard, two trying conflicting fixes. Customers are waiting.
This is what incidents look like without an Incident Commander (IC). The IC doesn’t debug; they coordinate.
Incident Commander (IC) Responsibilities
- Declare incident severity
- Assign roles (debugger, communicator, scribe)
- Coordinate investigation streams
- Make decisions (rollback? escalate? wait?)
- Manage communication (status page, stakeholders)
- Call for help when needed
- Declare all‑clear
What the IC Does NOT Do
- Write code
- Run queries
- SSH into servers
- Debug the issue
Incident Response Workflow
-
Acknowledge the page
-
Open incident channel:
#inc-YYYY-MM-DD-description -
Post severity declaration
I'm IC for this incident. Severity: P1 - Customer-facing checkout is down Impact: ~30% of checkout attempts failing Roles: - @alice: Primary debugger - @bob: Comms (status page + Slack updates) - @charlie: Scribe (timeline) First actions: - @alice: Check last deploy and error logs - @bob: Post initial status page update - I'll update every 10 minutes.
Structured Investigation Loop (Every 5 minutes)
- “@alice, what have you found?”
- Synthesize information
- Decide next action
- Assign next task
- Update channel: “Current theory: [X]. Testing: [Y].”
def ic_decision_tree(situation):
if situation.root_cause_known:
if situation.fix_available:
return "Deploy fix with canary"
else:
return "Rollback to last known good"
if situation.duration > 15 and not situation.making_progress:
return "Escalate: bring in additional expertise"
if situation.customer_impact_growing:
return "Escalate severity + enable fallback"
return "Continue investigation, update in 5 min"
Pre‑written Templates
Internal Update
format: |
**Incident Update [{severity}] {time} UTC**
Status: {investigating|identified|monitoring|resolved}
Impact: {impact_description}
Current action: {what_we_are_doing}
Next update: {time_of_next_update}
Status Page Update
format: |
We are {status} an issue affecting {service}.
Some users may experience {symptom}.
Our team is actively working on a resolution.
Next update in {minutes} minutes.
Executive Escalation
format: |
P1 Incident: {title}
Duration: {duration} minutes
Customer impact: {impact}
Revenue impact: ~${revenue}/hour
Current status: {status}
ETA to resolution: {eta}
Training the ICs (Game Days)
- Week 1: Shadow an experienced IC during a game day
- Week 2: IC a simulated P2 incident (game day)
- Week 3: IC a simulated P1 incident (game day)
- Week 4: IC a real P3/P4 incident with a mentor observing
- Week 5+: IC rotation for all severities
IC Rotation
ic_rotation:
schedule: weekly
pool_size: 6 # Minimum for sustainable rotation
requirements:
- Completed IC training program
- At least 6 months on the team
- Shadowed 3+ real incidents
compensation:
- Same as on‑call compensation
- IC counts as on‑call time
Metrics Comparison
| Metric | Without IC | With IC |
|---|---|---|
| MTTR (P1) | 67 min | 28 min |
| Communication gaps | Frequent | Rare |
| Duplicate work | ~40 % | ~5 % |
| Stakeholder satisfaction | Low | High |
| Post‑mortem quality | Incomplete | Thorough |
Takeaway
The IC doesn’t make incidents shorter because they’re smarter; they make incidents shorter because someone is actually managing the response.
If you want AI‑assisted incident coordination that makes every engineer an effective IC, check out what we’re building at Nova AI Ops: