The Incident Commander Role: Running Incidents Without Chaos

Published: (April 21, 2026 at 03:33 AM EDT)
3 min read
Source: Dev.to

Source: Dev.to

Everyone’s Debugging, Nobody’s Leading

Five engineers in an incident channel, all debugging independently. No coordination. Three people checking the same dashboard, two trying conflicting fixes. Customers are waiting.

This is what incidents look like without an Incident Commander (IC). The IC doesn’t debug; they coordinate.

Incident Commander (IC) Responsibilities

  • Declare incident severity
  • Assign roles (debugger, communicator, scribe)
  • Coordinate investigation streams
  • Make decisions (rollback? escalate? wait?)
  • Manage communication (status page, stakeholders)
  • Call for help when needed
  • Declare all‑clear

What the IC Does NOT Do

  • Write code
  • Run queries
  • SSH into servers
  • Debug the issue

Incident Response Workflow

  1. Acknowledge the page

  2. Open incident channel: #inc-YYYY-MM-DD-description

  3. Post severity declaration

    I'm IC for this incident.
    Severity: P1 - Customer-facing checkout is down
    Impact: ~30% of checkout attempts failing
    
    Roles:
    - @alice: Primary debugger
    - @bob: Comms (status page + Slack updates)
    - @charlie: Scribe (timeline)
    
    First actions:
    - @alice: Check last deploy and error logs
    - @bob: Post initial status page update
    - I'll update every 10 minutes.

Structured Investigation Loop (Every 5 minutes)

  1. “@alice, what have you found?”
  2. Synthesize information
  3. Decide next action
  4. Assign next task
  5. Update channel: “Current theory: [X]. Testing: [Y].”
def ic_decision_tree(situation):
    if situation.root_cause_known:
        if situation.fix_available:
            return "Deploy fix with canary"
        else:
            return "Rollback to last known good"

    if situation.duration > 15 and not situation.making_progress:
        return "Escalate: bring in additional expertise"

    if situation.customer_impact_growing:
        return "Escalate severity + enable fallback"

    return "Continue investigation, update in 5 min"

Pre‑written Templates

Internal Update

format: |
  **Incident Update [{severity}] {time} UTC**
  Status: {investigating|identified|monitoring|resolved}
  Impact: {impact_description}
  Current action: {what_we_are_doing}
  Next update: {time_of_next_update}

Status Page Update

format: |
  We are {status} an issue affecting {service}.
  Some users may experience {symptom}.
  Our team is actively working on a resolution.
  Next update in {minutes} minutes.

Executive Escalation

format: |
  P1 Incident: {title}
  Duration: {duration} minutes
  Customer impact: {impact}
  Revenue impact: ~${revenue}/hour
  Current status: {status}
  ETA to resolution: {eta}

Training the ICs (Game Days)

  • Week 1: Shadow an experienced IC during a game day
  • Week 2: IC a simulated P2 incident (game day)
  • Week 3: IC a simulated P1 incident (game day)
  • Week 4: IC a real P3/P4 incident with a mentor observing
  • Week 5+: IC rotation for all severities

IC Rotation

ic_rotation:
  schedule: weekly
  pool_size: 6  # Minimum for sustainable rotation
  requirements:
    - Completed IC training program
    - At least 6 months on the team
    - Shadowed 3+ real incidents
  compensation:
    - Same as on‑call compensation
    - IC counts as on‑call time

Metrics Comparison

MetricWithout ICWith IC
MTTR (P1)67 min28 min
Communication gapsFrequentRare
Duplicate work~40 %~5 %
Stakeholder satisfactionLowHigh
Post‑mortem qualityIncompleteThorough

Takeaway

The IC doesn’t make incidents shorter because they’re smarter; they make incidents shorter because someone is actually managing the response.

If you want AI‑assisted incident coordination that makes every engineer an effective IC, check out what we’re building at Nova AI Ops:

0 views
Back to Blog

Related posts

Read more »