Incident response & blameless post-mortems: writing better runbooks and SLO/SLI definitions

Published: (February 3, 2026 at 12:02 PM EST)
3 min read
Source: Dev.to

Source: Dev.to

What We Learned

  • How to define SLOs that actually matter
  • How to write runbooks that get used
  • How to run incidents without chaos
  • How to conduct blameless post‑mortems that prevent recurrence

SLOs vs. SLIs

Most teams either have no SLOs or have “fake” ones—numbers picked from the air that don’t connect to user experience or engineering decisions. Good SLOs change how you prioritize work:

  • Healthy error budget → ship features
  • Burning error budget → focus on reliability

A Service Level Indicator (SLI) should reflect user experience, not just server health.

Bad SLIs

  • CPU utilization
  • Memory usage
  • Number of pods running

Good SLIs

  • Request success rate (non‑5xx / total)
  • Request latency at p99
  • Data freshness

Example Prometheus Queries

# Availability SLI
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# Latency SLI (p99)
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
/
sum(rate(http_requests_total[5m]))

Tip: Start with looser targets and tighten them later. Users care about end‑to‑end journeys, not individual services.

Journey‑Centric SLO Definition (YAML)

journeys:
  - name: checkout
    slo:
      availability: 99.95%
      latency_p99: 3s
    components:
      - api-gateway
      - auth-service
      - cart-service
      - inventory-service
      - payments-service
      - order-service
    measurement:
      endpoint: /api/v1/checkout/health
      rum_event: checkout_completed

Error‑Budget Threshold Actions

thresholds:
  - budget_remaining: 50%
    actions:
      - notify: slack
  - budget_remaining: 25%
    actions:
      - freeze: non_critical_deployments
  - budget_remaining: 10%
    actions:
      - freeze: all_deployments
      - meeting: reliability_review
  - budget_remaining: 0%
    actions:
      - focus: reliability_only

SLO Configuration Example

slos:
  - name: requests-availability
    objective: 99.95
    sli:
      events:
        error_query: sum(rate(http_requests_total{status=~"5.."}[{{.window}}]))
        total_query: sum(rate(http_requests_total[{{.window}}]))

Writing Effective Runbooks

Good runbooks are:

  • Scannable – easy to skim under pressure
  • Actionable – clear next steps
  • Tested – verified in staging or drills

Sample Runbook: Payments API High Error Rate

Detection

Alert: PaymentsAPIHighErrorRate

Step 1: Check provider status

curl -s https://status.stripe.com/api/v2/summary.json

Step 2: Review recent deploys

kubectl rollout history deployment/payments-api

Step 3: Rollback if needed

kubectl rollout undo deployment/payments-api

Step 4: Inspect DB connection pool

curl http://payments-api/debug/metrics | grep db_pool

Automated Test (PHP)

public function testDatabaseConnectionExhaustionRunbook(): void
{
    // Simulate connection exhaustion
    $this->simulateDbPoolExhaustion();

    // Verify alert condition
    $metrics = $this->fetchMetrics('/debug/metrics');
    $this->assertLessThan(5, $metrics['db_pool_available']);

    // Apply mitigation
    $this->scaleServiceReplicas(10);

    // Verify recovery
    $this->assertTrue($this->serviceRecovered());
}

Incident Roles

RoleResponsibility
Incident CommanderCoordinates the response
Tech LeadLeads debugging effort
Comms LeadHandles stakeholder communication

Severity levels

  • SEV1 – Complete outage or data loss
  • SEV2 – Major degradation
  • SEV3 – Minor impact

Real Incident Example

🔴 INCIDENT: Checkout Errors
Severity: SEV2
Impact: Success rate 82 %

RoleOwner
IC@alice
Tech@bob
Comms@carol

Timeline

  • 14:32 – Alert fired
  • 14:40 – Stripe returning 503s
  • 14:45 – Circuit breaker engaged
  • 15:15 – Resolved

Incident Bot (PHP)

class IncidentBot
{
    public function declareIncident(array $data): Incident
    {
        $incident = Incident::create([
            'title'    => $data['title'],
            'severity' => $data['severity'],
            'status'   => 'investigating',
        ]);

        $this->createSlackChannel($incident);
        $this->notifyPagerDuty($incident);

        return $incident;
    }

    public function resolveIncident(Incident $incident): void
    {
        $incident->update(['status' => 'resolved']);
        $this->schedulePostmortem($incident);
    }
}

Post‑mortem Summary

Summary – Checkout degraded for 43 minutes.

Root Cause – Circuit‑breaker threshold set too high.

Action Items

ActionOwnerDeadline
Lower threshold@bobJan 22
Add alert@aliceJan 23

Action‑Item Tracker (PHP)

class ActionItemTracker
{
    public function weeklyDigest(): void
    {
        $overdue = ActionItem::overdue()->get()->groupBy('owner');

        foreach ($overdue as $owner => $items) {
            $this->notifyOwner($owner, $items);
        }
    }
}

Before / After Metrics

MetricBeforeAfter
MTTR4 hours35 min
Repeat incidents4/q1/q
Error‑budget remaining12 %58 %

Reliability as a Discipline

  • SLOs tell you when things are wrong
  • Runbooks help you fix them
  • Incident roles prevent chaos
  • Post‑mortems prevent recurrence

Key Takeaways

  1. Define SLOs around user journeys, not individual services.
  2. Use error budgets to guide prioritization decisions.
  3. Write runbooks that are scannable, actionable, and regularly tested.
  4. Keep incidents structured with clear roles and communication channels.
  5. Conduct blameless post‑mortems and track action items relentlessly.
Back to Blog

Related posts

Read more »