We Replaced Message Buses with Telemetry for AI Agent Coordination
Source: Dev.to
Challenges of Traditional Message Buses
- Coordination overhead – explicit message passing requires careful protocol design.
- Debugging nightmares – failures are pieced together from scattered messages across agents.
- Scaling issues – more agents → exponentially more routing logic.
- State management – keeping agents synchronized demands complex state machines.
These problems became a bottleneck in our AI‑powered development workflows.
Telemetry as Coordination Mechanism
Instead of sending explicit messages, each agent emits structured telemetry to a shared observability backend (e.g., SigNoz/ClickHouse). Agents then query that telemetry to discover what other agents have done and decide their next actions.
Push model → Stigmergic model: agents react to traces left in the environment rather than to direct messages.
otel‑ops‑pack Core Loop
1. Every Agent Operation Becomes a Telemetry Span
with tracer.start_as_current_span("agent_task") as span:
span.set_attribute("agent.id", "cursor-agent-1")
span.set_attribute("task.type", "code_generation")
span.set_attribute("task.status", "complete")
span.set_attribute("quality.score", 0.95)
# Do work...
2. Agents Query Their Own Telemetry
def check_prerequisites(task_id):
"""Check if prerequisite tasks are complete by querying telemetry"""
query = f"""
SELECT status, quality_score
FROM spans
WHERE task_parent_id = '{task_id}'
AND status = 'complete'
"""
results = telemetry_client.query(query)
return len(results) > 0
3. Emergent Coordination
Agents naturally coordinate because they share a single source of truth. No explicit messages are needed, and an audit trail is built in automatically.
Evidence‑Based Governance
On top of telemetry‑based coordination we built BossCat, an evidence‑first governance framework. Gates act as checkpoints that require concrete telemetry evidence before allowing progress.
Evidence Rule – an agent cannot simply claim “Security Check Passed.” It must provide the span ID where the security tool wrote its output, preventing hallucinated compliance.
gate_requirements:
- name: "security_scan"
evidence_type: "telemetry_span"
span_name: "security_scan_complete"
required_attributes:
- "vulnerabilities.critical: 0"
Results
- 96 % of gates pass on the first attempt.
- Debug time for complex workflows dropped from hours to seconds.
- Coordination logic reduced by ≈ 85 %.
Why This Matters
As autonomous swarms become the norm, fragile message buses cannot sustain the required reliability. Telemetry‑driven architectures are self‑documenting, self‑auditing, and self‑correcting by design, providing a robust foundation for future AI infrastructure.
Open Source
We are open‑sourcing otel‑ops‑pack to help others adopt telemetry‑first coordination for their agent systems.