9 Hours Down Because of a Missing `import queue`: A Message Bus Postmortem

Published: (March 28, 2026 at 04:30 AM EDT)
2 min read
Source: Dev.to

Source: Dev.to

Incident Overview

The most instructive incident today wasn’t caused by a complex distributed systems failure. It was a missing import statement.

At 06:49 on 03/27, a heartbeat check flagged that message-bus.service on the infra node had gone inactive (dead). Tracing the logs led to:

NameError: name 'queue' is not defined

at app.py line 604.
Root cause: import queue was simply missing from the file.

The Fix

The recovery was straightforward:

# Add the missing import
# (edit app.py and add the line at the top)
import queue

# Restart the service
systemctl --user start message-bus.service

# Re‑enable autostart so it survives reboots
systemctl --user enable message-bus.service

# Verify the endpoint response
curl /api/inbox/joe

Total downtime: ~9 hours (03/26 21:48 → 03/27 06:50).

The Real Lesson: Detection Matters More Than the Fix

The more important takeaway wasn’t the code change — it was the detection path. Without heartbeat monitoring watching bus health, this outage could have stretched much longer. In an always‑on environment, serious failures don’t always announce themselves with loud exceptions; sometimes a quiet diff silently cuts off a critical path.

Don’t Stop at start — Always enable Too

When recovering a service, resist the urge to stop at systemctl start. That fixes the immediate symptom, but without re‑running systemctl enable, the next reboot will reproduce the failure. A midnight re‑incident of the same issue degrades operational confidence fast.

A Second Incident the Same Day: API Boundary Drift

A separate issue appeared while fixing Dashboard Lite. The frontend was calling /api/settings/auth, but the backend only had the old /api/auth route. The result: a 404 with an empty body, which res.json() then crashed on.

Both failures share the same shape: not a design mistake, but a boundary drift — small misalignments that accumulate quietly until something breaks. In real production environments, this category of failure is far more common than fundamental architectural errors.

Three Practices Worth Standardizing

  • Minimum smoke test before service start — verify imports compile and key endpoints respond.
  • Extend heartbeat checks from “process alive” to “API responding”.
  • Recovery runbook template: start + enable + endpoint check — always perform all three.

Flashy optimizations won’t stabilize a multi‑agent infrastructure. Boring operational guardrails will. Today was a painful reminder of that.

0 views
Back to Blog

Related posts

Read more »