We Replaced Message Buses with Telemetry for AI Agent Coordination

Published: (December 22, 2025 at 10:02 AM EST)
2 min read
Source: Dev.to

Source: Dev.to

Challenges of Traditional Message Buses

  • Coordination overhead – explicit message passing requires careful protocol design.
  • Debugging nightmares – failures are pieced together from scattered messages across agents.
  • Scaling issues – more agents → exponentially more routing logic.
  • State management – keeping agents synchronized demands complex state machines.

These problems became a bottleneck in our AI‑powered development workflows.

Telemetry as Coordination Mechanism

Instead of sending explicit messages, each agent emits structured telemetry to a shared observability backend (e.g., SigNoz/ClickHouse). Agents then query that telemetry to discover what other agents have done and decide their next actions.

Push model → Stigmergic model: agents react to traces left in the environment rather than to direct messages.

otel‑ops‑pack Core Loop

1. Every Agent Operation Becomes a Telemetry Span

with tracer.start_as_current_span("agent_task") as span:
    span.set_attribute("agent.id", "cursor-agent-1")
    span.set_attribute("task.type", "code_generation")
    span.set_attribute("task.status", "complete")
    span.set_attribute("quality.score", 0.95)
    # Do work...

2. Agents Query Their Own Telemetry

def check_prerequisites(task_id):
    """Check if prerequisite tasks are complete by querying telemetry"""
    query = f"""
    SELECT status, quality_score
    FROM spans
    WHERE task_parent_id = '{task_id}'
      AND status = 'complete'
    """
    results = telemetry_client.query(query)
    return len(results) > 0

3. Emergent Coordination

Agents naturally coordinate because they share a single source of truth. No explicit messages are needed, and an audit trail is built in automatically.

Evidence‑Based Governance

On top of telemetry‑based coordination we built BossCat, an evidence‑first governance framework. Gates act as checkpoints that require concrete telemetry evidence before allowing progress.

Evidence Rule – an agent cannot simply claim “Security Check Passed.” It must provide the span ID where the security tool wrote its output, preventing hallucinated compliance.

gate_requirements:
  - name: "security_scan"
    evidence_type: "telemetry_span"
    span_name: "security_scan_complete"
    required_attributes:
      - "vulnerabilities.critical: 0"

Results

  • 96 % of gates pass on the first attempt.
  • Debug time for complex workflows dropped from hours to seconds.
  • Coordination logic reduced by ≈ 85 %.

Why This Matters

As autonomous swarms become the norm, fragile message buses cannot sustain the required reliability. Telemetry‑driven architectures are self‑documenting, self‑auditing, and self‑correcting by design, providing a robust foundation for future AI infrastructure.

Open Source

We are open‑sourcing otel‑ops‑pack to help others adopt telemetry‑first coordination for their agent systems.

Back to Blog

Related posts

Read more »