How to Turn Slow Queries into Actionable Reliability Metrics with OpenTelemetry

Published: (February 4, 2026 at 05:11 AM EST)
5 min read
Source: Dev.to

Source: Dev.to

Introduction

Slow SQL queries degrade user experience, cause cascading failures, and turn simple operations into production incidents. The traditional fix? Collect more telemetry. But more telemetry means more things to look at, not necessarily more understanding.

Instead of treating traces as a data stream we might analyze someday, we should be opinionated about what matters at the moment of decision. As we argued in The Signal in the Storm, raw telemetry only becomes useful when we extract meaningful patterns.

In this guide you’ll build a repeatable workflow that turns OpenTelemetry database spans into span‑derived metrics you can dashboard and alert on—so you can identify what’s slow, what matters most, and what just regressed.

Use Cases

We’ll make this concrete with slow SQL queries, serving two use cases:

  • Optimization – Which queries yield the most value if made faster, weighted by traffic?
  • Incident response – Which queries are behaving abnormally right now?

Lab Overview

We’ll build a lab where your app emits OpenTelemetry traces, and we distill those into actionable metrics, starting with simple slow query detection, then adding traffic‑weighted impact, and finally anomaly detection.

Skip the theory? Jump to the Lab section below.

Why “Slow” Is a Symptom, Not a Problem

A 50 ms query might be fine for a reporting dashboard but catastrophic for checkout. As High Performance MySQL emphasizes, understanding why a query is slow determines how to fix it. The most common causes are:

ProblemDescription
Missing or unusable indexesFull table scans; e.g., SELECT * FROM orders WHERE customer_id = $1 grows from 20 ms at 10 K rows to minutes at 10 M rows.
Bad join/aggregation plansPlanner misjudges cardinality and picks the wrong join strategy.
Resource contentionLock contention, connection‑pool exhaustion, CPU/I/O saturation, memory pressure.
Plan regressionsParameter‑sensitive plans, stale statistics after bulk loads.
N+1 problemMany fast queries (e.g., 100 × 2 ms) add up to large latency.
Lack of contextDatabase tools show what is slow, not why it matters to the user‑facing service.

The Value of Context‑Rich Traces

Distributed traces embed each database span in a request context (service, endpoint, user). Instead of correlating logs and traces after the fact, we can analyze slow queries directly from traces with full application context.

Building Blocks

  • OpenTelemetry Collector paired with docker-otel-lgtm (Grafana stack: Loki, Grafana, Tempo, Mimir).
  • Sample app: Go‑based “Album API” serving music album data from PostgreSQL, instrumented with otelsql.
  • Three dashboards:
    1. Queries by duration
    2. Queries weighted by traffic (impact)
    3. Anomaly detection

Lab Setup

git clone https://github.com/causely-oss/slow-query-lab
cd slow-query-lab
docker-compose up -d

Once running, open Grafana at http://localhost:3001 to explore the dashboards.

Dashboard 1 – Slow SQL by Duration

TraceQL query

{ span.db.system != "" } | select(span.db.query.text, span.db.statement)
  • Groups by root operation (API endpoint) and SQL statement.
  • Aggregates duration into mean, max, and count.
  • Sorts by average duration (slowest first).

What You Get

  • Table of slowest queries with full application context.
  • Count of occurrences.
  • Click‑through to individual traces for debugging.

Limitation

Sorting by average duration ignores traffic volume. A query that runs 2 s five times looks “worse” than a 150 ms query that runs 10 K times, even though the latter impacts more users.

Dashboard 2 – Traffic‑Weighted Impact

Impact formula

Impact = Avg Duration × Count

The same TraceQL query is used, but we add a calculated Impact field and sort by it.

Added Insights

  • Service breakdown – which service triggered each query.
  • Latency distribution – visualize duration over time.
  • Top queries by impact – prioritize optimization work.

Why It Matters

High‑volume, moderately‑slow queries surface above rare‑but‑slow ones, giving a defensible answer to “which slow queries should we optimize first?”.

Dashboard 3 – Anomaly Detection

To answer “what has changed?” we distill metrics from spans using the spanmetrics connector.

Collector Configuration (excerpt)

connectors:
  spanmetrics:
    dimensions:
      - name: db.system
        default: "unknown"
      - name: db.query.text
      - name: db.statement
      - name: db.name
        default: "unknown"
    exemplars:
      enabled: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [transform, batch]
      exporters: [spanmetrics, otlphttp/lgtm]

    metrics:
      receivers: [spanmetrics]
      processors: [batch]
      exporters: [otlphttp/lgtm]

The connector generates histogram metrics for query latency, labeled by:

  • service_name – originating service
  • db_system – e.g., postgresql
  • db_query_text / db_statement – the SQL query (template)
  • db_name – database name

These metrics are stored in Mimir (Prometheus‑compatible) where we apply PromQL‑based anomaly detection.

Prometheus Recording Rules (from Grafana’s Anomaly Detection framework)

  • Baseline – smoothed average of historical values.
  • Upper/Lower bands – baseline ± N standard deviations.
  • Anomaly – current value exceeds the bands.

Anomaly Detection Dashboard

  • Plots current latency against adaptive baseline bands.
  • Highlights anomalies per query.
  • Shows per‑query breakdown for quick identification.

How It Answers “What Has Changed?”

  • A query that is always slow (e.g., baseline = 450 ms) does not trigger an anomaly when it stays at 450 ms.
  • A normally fast query (baseline = 50 ms) triggers an anomaly if it spikes to 200 ms.

Metrics Hygiene

  • Cardinality explosion – raw SQL in metric labels creates a series per literal value. Mitigate by:

    • Using prepared statements (captures templates, not literals).
    • Normalizing query text.
    • Setting aggregation_cardinality_limit in the spanmetrics connector.
  • Sensitive data – redact or drop sensitive attributes in the Collector before exporting.

  • Baseline warm‑up – adaptive rules need 24–48 h of data; start with wider bands and tighten over time.

From Symptoms to Root Causes

Even with anomaly detection you’re seeing symptoms. Real‑world incidents often involve multiple symptoms that need correlation:

  • Missing index after a schema migration
  • Query‑plan regression due to stale statistics
  • Lock contention from a concurrent batch job
  • Resource pressure from a noisy neighbor
  • Upstream service degradation causing retry storms

Manual triage is possible but time‑consuming and doesn’t scale.

Enter Causely

Causely automates the pattern we built:

  1. Distill – extracts slow queries and other symptoms.
  2. Causal model – connects symptoms to system dependencies (endpoints, users, upstream services).
  3. Root‑cause identification – traces causation chains (e.g., “search queries are slow because the index was dropped”).
  4. Actionable recommendations – via AskCausely you get concrete next steps (add index, revert deployment, address upstream pressure).

The distill‑detect‑surface pipeline is the foundation; Causely extends it to full root‑cause analysis at scale.

Conclusion

By turning OpenTelemetry database spans into metrics, applying traffic‑weighted impact scoring, and layering anomaly detection, you gain:

  • Immediate visibility into slow queries with full application context.
  • Prioritization based on real user impact.
  • Automated detection of abnormal behavior.

Combine this with Causely’s causal reasoning to move from symptoms to root causes, enabling faster, data‑driven performance engineering and incident response.


Try it yourself: ask Causely about slow queries and see how it connects them to their root causes.

Back to Blog

Related posts

Read more »