Incident response & blameless post-mortems: writing better runbooks and SLO/SLI definitions

Published: 3 months ago (February 3, 2026 at 12:02 PM EST)

4 min read

Source: Dev.to

Source: Dev.to

What We Learned

How to define SLOs that actually matter
How to write runbooks that get used
How to run incidents without chaos
How to conduct blameless post‑mortems that prevent recurrence

SLOs vs. SLIs

Most teams either have no SLOs or have “fake” ones—numbers picked from the air that don’t connect to user experience or engineering decisions. Good SLOs change how you prioritize work:

Healthy error budget → ship features
Burning error budget → focus on reliability

A Service Level Indicator (SLI) should reflect user experience, not just server health.

Bad SLIs

CPU utilization
Memory usage
Number of pods running

Good SLIs

Request success rate (non‑5xx / total)
Request latency at p99
Data freshness

Example Prometheus Queries

# Availability SLI
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# Latency SLI (p99)
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
/
sum(rate(http_requests_total[5m]))

Tip: Start with looser targets and tighten them later. Users care about end‑to‑end journeys, not individual services.

Journey‑Centric SLO Definition (YAML)

journeys:
  - name: checkout
    slo:
      availability: 99.95%
      latency_p99: 3s
    components:
      - api-gateway
      - auth-service
      - cart-service
      - inventory-service
      - payments-service
      - order-service
    measurement:
      endpoint: /api/v1/checkout/health
      rum_event: checkout_completed

Error‑Budget Threshold Actions

thresholds:
  - budget_remaining: 50%
    actions:
      - notify: slack
  - budget_remaining: 25%
    actions:
      - freeze: non_critical_deployments
  - budget_remaining: 10%
    actions:
      - freeze: all_deployments
      - meeting: reliability_review
  - budget_remaining: 0%
    actions:
      - focus: reliability_only

SLO Configuration Example

slos:
  - name: requests-availability
    objective: 99.95
    sli:
      events:
        error_query: sum(rate(http_requests_total{status=~"5.."}[{{.window}}]))
        total_query: sum(rate(http_requests_total[{{.window}}]))

Writing Effective Runbooks

Good runbooks are:

Scannable – easy to skim under pressure
Actionable – clear next steps
Tested – verified in staging or drills

Sample Runbook: Payments API High Error Rate

Detection

Alert: PaymentsAPIHighErrorRate

Step 1: Check provider status

curl -s https://status.stripe.com/api/v2/summary.json

Step 2: Review recent deploys

kubectl rollout history deployment/payments-api

Step 3: Rollback if needed

kubectl rollout undo deployment/payments-api

Step 4: Inspect DB connection pool

curl http://payments-api/debug/metrics | grep db_pool

Automated Test (PHP)

public function testDatabaseConnectionExhaustionRunbook(): void
{
    // Simulate connection exhaustion
    $this->simulateDbPoolExhaustion();

    // Verify alert condition
    $metrics = $this->fetchMetrics('/debug/metrics');
    $this->assertLessThan(5, $metrics['db_pool_available']);

    // Apply mitigation
    $this->scaleServiceReplicas(10);

    // Verify recovery
    $this->assertTrue($this->serviceRecovered());
}

Incident Roles

Role	Responsibility
Incident Commander	Coordinates the response
Tech Lead	Leads debugging effort
Comms Lead	Handles stakeholder communication

Severity levels

SEV1 – Complete outage or data loss
SEV2 – Major degradation
SEV3 – Minor impact

Real Incident Example

🔴 INCIDENT: Checkout Errors
Severity: SEV2
Impact: Success rate 82 %

Role	Owner
IC	@alice
Tech	@bob
Comms	@carol

Timeline

14:32 – Alert fired
14:40 – Stripe returning 503s
14:45 – Circuit breaker engaged
15:15 – Resolved

Incident Bot (PHP)

class IncidentBot
{
    public function declareIncident(array $data): Incident
    {
        $incident = Incident::create([
            'title'    => $data['title'],
            'severity' => $data['severity'],
            'status'   => 'investigating',
        ]);

        $this->createSlackChannel($incident);
        $this->notifyPagerDuty($incident);

        return $incident;
    }

    public function resolveIncident(Incident $incident): void
    {
        $incident->update(['status' => 'resolved']);
        $this->schedulePostmortem($incident);
    }
}

Post‑mortem Summary

Summary – Checkout degraded for 43 minutes.

Root Cause – Circuit‑breaker threshold set too high.

Action Items

Action	Owner	Deadline
Lower threshold	@bob	Jan 22
Add alert	@alice	Jan 23

Action‑Item Tracker (PHP)

class ActionItemTracker
{
    public function weeklyDigest(): void
    {
        $overdue = ActionItem::overdue()->get()->groupBy('owner');

        foreach ($overdue as $owner => $items) {
            $this->notifyOwner($owner, $items);
        }
    }
}

Before / After Metrics

Metric	Before	After
MTTR	4 hours	35 min
Repeat incidents	4/q	1/q
Error‑budget remaining	12 %	58 %

Reliability as a Discipline

SLOs tell you when things are wrong
Runbooks help you fix them
Incident roles prevent chaos
Post‑mortems prevent recurrence

Key Takeaways

Define SLOs around user journeys, not individual services.
Use error budgets to guide prioritization decisions.
Write runbooks that are scannable, actionable, and regularly tested.
Keep incidents structured with clear roles and communication channels.
Conduct blameless post‑mortems and track action items relentlessly.

Incident response & blameless post-mortems: writing better runbooks and SLO/SLI definitions

What We Learned

SLOs vs. SLIs

Bad SLIs

Good SLIs

Example Prometheus Queries

Journey‑Centric SLO Definition (YAML)

Error‑Budget Threshold Actions

SLO Configuration Example

Writing Effective Runbooks

Sample Runbook: Payments API High Error Rate

Detection

Step 1: Check provider status

Step 2: Review recent deploys

Step 3: Rollback if needed

Step 4: Inspect DB connection pool

Automated Test (PHP)

Incident Roles

Real Incident Example

Incident Bot (PHP)

Post‑mortem Summary

Action Items

Action‑Item Tracker (PHP)

Before / After Metrics

Reliability as a Discipline

Key Takeaways

Related posts

Your AI Agent Just Got a Credit Card: Introducing x402 Bazaar

Smartfind.ai

Building a Jedi-Style Hand Gesture Interface with TensorFlow.js: Control Your Browser Without Touching Anything

How to Sync AI Skills Across Claude Code, OpenClaw, and Codex in 2 Minutes

What We Learned

SLOs vs. SLIs

Bad SLIs

Good SLIs

Example Prometheus Queries

Journey‑Centric SLO Definition (YAML)

Error‑Budget Threshold Actions

SLO Configuration Example

Writing Effective Runbooks

Sample Runbook: Payments API High Error Rate

Detection

Step 1: Check provider status

Step 2: Review recent deploys

Step 3: Rollback if needed

Step 4: Inspect DB connection pool

Automated Test (PHP)

Incident Roles

Real Incident Example

Incident Bot (PHP)

Post‑mortem Summary

Action Items

Action‑Item Tracker (PHP)

Before / After Metrics

Reliability as a Discipline

Key Takeaways

Related posts

Your AI Agent Just Got a Credit Card: Introducing x402 Bazaar

Smartfind.ai

Building a Jedi-Style Hand Gesture Interface with TensorFlow.js: Control Your Browser Without Touching Anything

How to Sync AI Skills Across Claude Code, OpenClaw, and Codex in 2 Minutes

Step 1: Check provider status

Step 2: Review recent deploys

Step 3: Rollback if needed

Step 4: Inspect DB connection pool