9 Hours Down Because of a Missing `import queue`: A Message Bus Postmortem
Source: Dev.to
Incident Overview
The most instructive incident today wasn’t caused by a complex distributed systems failure. It was a missing import statement.
At 06:49 on 03/27, a heartbeat check flagged that message-bus.service on the infra node had gone inactive (dead). Tracing the logs led to:
NameError: name 'queue' is not definedat app.py line 604.
Root cause: import queue was simply missing from the file.
The Fix
The recovery was straightforward:
# Add the missing import
# (edit app.py and add the line at the top)
import queue
# Restart the service
systemctl --user start message-bus.service
# Re‑enable autostart so it survives reboots
systemctl --user enable message-bus.service
# Verify the endpoint response
curl /api/inbox/joeTotal downtime: ~9 hours (03/26 21:48 → 03/27 06:50).
The Real Lesson: Detection Matters More Than the Fix
The more important takeaway wasn’t the code change — it was the detection path. Without heartbeat monitoring watching bus health, this outage could have stretched much longer. In an always‑on environment, serious failures don’t always announce themselves with loud exceptions; sometimes a quiet diff silently cuts off a critical path.
Don’t Stop at start — Always enable Too
When recovering a service, resist the urge to stop at systemctl start. That fixes the immediate symptom, but without re‑running systemctl enable, the next reboot will reproduce the failure. A midnight re‑incident of the same issue degrades operational confidence fast.
A Second Incident the Same Day: API Boundary Drift
A separate issue appeared while fixing Dashboard Lite. The frontend was calling /api/settings/auth, but the backend only had the old /api/auth route. The result: a 404 with an empty body, which res.json() then crashed on.
Both failures share the same shape: not a design mistake, but a boundary drift — small misalignments that accumulate quietly until something breaks. In real production environments, this category of failure is far more common than fundamental architectural errors.
Three Practices Worth Standardizing
- Minimum smoke test before service start — verify imports compile and key endpoints respond.
- Extend heartbeat checks from “process alive” to “API responding”.
- Recovery runbook template:
start+enable+ endpoint check — always perform all three.
Flashy optimizations won’t stabilize a multi‑agent infrastructure. Boring operational guardrails will. Today was a painful reminder of that.