Incident response & blameless post-mortems: writing better runbooks and SLO/SLI definitions
Source: Dev.to
What We Learned
- How to define SLOs that actually matter
- How to write runbooks that get used
- How to run incidents without chaos
- How to conduct blameless post‑mortems that prevent recurrence
SLOs vs. SLIs
Most teams either have no SLOs or have “fake” ones—numbers picked from the air that don’t connect to user experience or engineering decisions. Good SLOs change how you prioritize work:
- Healthy error budget → ship features
- Burning error budget → focus on reliability
A Service Level Indicator (SLI) should reflect user experience, not just server health.
Bad SLIs
- CPU utilization
- Memory usage
- Number of pods running
Good SLIs
- Request success rate (non‑5xx / total)
- Request latency at p99
- Data freshness
Example Prometheus Queries
# Availability SLI
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# Latency SLI (p99)
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
/
sum(rate(http_requests_total[5m]))
Tip: Start with looser targets and tighten them later. Users care about end‑to‑end journeys, not individual services.
Journey‑Centric SLO Definition (YAML)
journeys:
- name: checkout
slo:
availability: 99.95%
latency_p99: 3s
components:
- api-gateway
- auth-service
- cart-service
- inventory-service
- payments-service
- order-service
measurement:
endpoint: /api/v1/checkout/health
rum_event: checkout_completed
Error‑Budget Threshold Actions
thresholds:
- budget_remaining: 50%
actions:
- notify: slack
- budget_remaining: 25%
actions:
- freeze: non_critical_deployments
- budget_remaining: 10%
actions:
- freeze: all_deployments
- meeting: reliability_review
- budget_remaining: 0%
actions:
- focus: reliability_only
SLO Configuration Example
slos:
- name: requests-availability
objective: 99.95
sli:
events:
error_query: sum(rate(http_requests_total{status=~"5.."}[{{.window}}]))
total_query: sum(rate(http_requests_total[{{.window}}]))
Writing Effective Runbooks
Good runbooks are:
- Scannable – easy to skim under pressure
- Actionable – clear next steps
- Tested – verified in staging or drills
Sample Runbook: Payments API High Error Rate
Detection
Alert: PaymentsAPIHighErrorRate
Step 1: Check provider status
curl -s https://status.stripe.com/api/v2/summary.json
Step 2: Review recent deploys
kubectl rollout history deployment/payments-api
Step 3: Rollback if needed
kubectl rollout undo deployment/payments-api
Step 4: Inspect DB connection pool
curl http://payments-api/debug/metrics | grep db_pool
Automated Test (PHP)
public function testDatabaseConnectionExhaustionRunbook(): void
{
// Simulate connection exhaustion
$this->simulateDbPoolExhaustion();
// Verify alert condition
$metrics = $this->fetchMetrics('/debug/metrics');
$this->assertLessThan(5, $metrics['db_pool_available']);
// Apply mitigation
$this->scaleServiceReplicas(10);
// Verify recovery
$this->assertTrue($this->serviceRecovered());
}
Incident Roles
| Role | Responsibility |
|---|---|
| Incident Commander | Coordinates the response |
| Tech Lead | Leads debugging effort |
| Comms Lead | Handles stakeholder communication |
Severity levels
- SEV1 – Complete outage or data loss
- SEV2 – Major degradation
- SEV3 – Minor impact
Real Incident Example
🔴 INCIDENT: Checkout Errors
Severity: SEV2
Impact: Success rate 82 %
| Role | Owner |
|---|---|
| IC | @alice |
| Tech | @bob |
| Comms | @carol |
Timeline
- 14:32 – Alert fired
- 14:40 – Stripe returning 503s
- 14:45 – Circuit breaker engaged
- 15:15 – Resolved
Incident Bot (PHP)
class IncidentBot
{
public function declareIncident(array $data): Incident
{
$incident = Incident::create([
'title' => $data['title'],
'severity' => $data['severity'],
'status' => 'investigating',
]);
$this->createSlackChannel($incident);
$this->notifyPagerDuty($incident);
return $incident;
}
public function resolveIncident(Incident $incident): void
{
$incident->update(['status' => 'resolved']);
$this->schedulePostmortem($incident);
}
}
Post‑mortem Summary
Summary – Checkout degraded for 43 minutes.
Root Cause – Circuit‑breaker threshold set too high.
Action Items
| Action | Owner | Deadline |
|---|---|---|
| Lower threshold | @bob | Jan 22 |
| Add alert | @alice | Jan 23 |
Action‑Item Tracker (PHP)
class ActionItemTracker
{
public function weeklyDigest(): void
{
$overdue = ActionItem::overdue()->get()->groupBy('owner');
foreach ($overdue as $owner => $items) {
$this->notifyOwner($owner, $items);
}
}
}
Before / After Metrics
| Metric | Before | After |
|---|---|---|
| MTTR | 4 hours | 35 min |
| Repeat incidents | 4/q | 1/q |
| Error‑budget remaining | 12 % | 58 % |
Reliability as a Discipline
- SLOs tell you when things are wrong
- Runbooks help you fix them
- Incident roles prevent chaos
- Post‑mortems prevent recurrence
Key Takeaways
- Define SLOs around user journeys, not individual services.
- Use error budgets to guide prioritization decisions.
- Write runbooks that are scannable, actionable, and regularly tested.
- Keep incidents structured with clear roles and communication channels.
- Conduct blameless post‑mortems and track action items relentlessly.