We Replaced Message Buses with Telemetry for AI Agent Coordination

Published: 6 days ago (December 22, 2025 at 10:02 AM EST)

2 min read

Source: Dev.to

Challenges of Traditional Message Buses

Coordination overhead – explicit message passing requires careful protocol design.
Debugging nightmares – failures are pieced together from scattered messages across agents.
Scaling issues – more agents → exponentially more routing logic.
State management – keeping agents synchronized demands complex state machines.

These problems became a bottleneck in our AI‑powered development workflows.

Telemetry as Coordination Mechanism

Instead of sending explicit messages, each agent emits structured telemetry to a shared observability backend (e.g., SigNoz/ClickHouse). Agents then query that telemetry to discover what other agents have done and decide their next actions.

Push model → Stigmergic model: agents react to traces left in the environment rather than to direct messages.

otel‑ops‑pack Core Loop

1. Every Agent Operation Becomes a Telemetry Span

with tracer.start_as_current_span("agent_task") as span:
    span.set_attribute("agent.id", "cursor-agent-1")
    span.set_attribute("task.type", "code_generation")
    span.set_attribute("task.status", "complete")
    span.set_attribute("quality.score", 0.95)
    # Do work...

2. Agents Query Their Own Telemetry

def check_prerequisites(task_id):
    """Check if prerequisite tasks are complete by querying telemetry"""
    query = f"""
    SELECT status, quality_score
    FROM spans
    WHERE task_parent_id = '{task_id}'
      AND status = 'complete'
    """
    results = telemetry_client.query(query)
    return len(results) > 0

3. Emergent Coordination

Agents naturally coordinate because they share a single source of truth. No explicit messages are needed, and an audit trail is built in automatically.

Evidence‑Based Governance

On top of telemetry‑based coordination we built BossCat, an evidence‑first governance framework. Gates act as checkpoints that require concrete telemetry evidence before allowing progress.

Evidence Rule – an agent cannot simply claim “Security Check Passed.” It must provide the span ID where the security tool wrote its output, preventing hallucinated compliance.

gate_requirements:
  - name: "security_scan"
    evidence_type: "telemetry_span"
    span_name: "security_scan_complete"
    required_attributes:
      - "vulnerabilities.critical: 0"

Results

96 % of gates pass on the first attempt.
Debug time for complex workflows dropped from hours to seconds.
Coordination logic reduced by ≈ 85 %.

Why This Matters

As autonomous swarms become the norm, fragile message buses cannot sustain the required reliability. Telemetry‑driven architectures are self‑documenting, self‑auditing, and self‑correcting by design, providing a robust foundation for future AI infrastructure.

Open Source

We are open‑sourcing otel‑ops‑pack to help others adopt telemetry‑first coordination for their agent systems.

We Replaced Message Buses with Telemetry for AI Agent Coordination

Challenges of Traditional Message Buses

Telemetry as Coordination Mechanism

otel‑ops‑pack Core Loop

1. Every Agent Operation Becomes a Telemetry Span

2. Agents Query Their Own Telemetry

3. Emergent Coordination

Evidence‑Based Governance

Results

Why This Matters

Open Source

Related posts

Beyond Keywords: Engineering a Production-Ready Agentic Search Framework in Go

A Beginner’s Guide to AIOps: What IT Teams Need to Know

Regression testing workflow: the risk first checks that keep releases stable

The Best Developer AI Tools of 2025 — What Actually Worked in Real Projects