How to Turn Slow Queries into Actionable Reliability Metrics with OpenTelemetry
Source: Dev.to
Introduction
Slow SQL queries degrade user experience, cause cascading failures, and turn simple operations into production incidents. The traditional fix? Collect more telemetry. But more telemetry means more things to look at, not necessarily more understanding.
Instead of treating traces as a data stream we might analyze someday, we should be opinionated about what matters at the moment of decision. As we argued in The Signal in the Storm, raw telemetry only becomes useful when we extract meaningful patterns.
In this guide you’ll build a repeatable workflow that turns OpenTelemetry database spans into span‑derived metrics you can dashboard and alert on—so you can identify what’s slow, what matters most, and what just regressed.
Use Cases
We’ll make this concrete with slow SQL queries, serving two use cases:
- Optimization – Which queries yield the most value if made faster, weighted by traffic?
- Incident response – Which queries are behaving abnormally right now?
Lab Overview
We’ll build a lab where your app emits OpenTelemetry traces, and we distill those into actionable metrics, starting with simple slow query detection, then adding traffic‑weighted impact, and finally anomaly detection.
Skip the theory? Jump to the Lab section below.
Why “Slow” Is a Symptom, Not a Problem
A 50 ms query might be fine for a reporting dashboard but catastrophic for checkout. As High Performance MySQL emphasizes, understanding why a query is slow determines how to fix it. The most common causes are:
| Problem | Description |
|---|---|
| Missing or unusable indexes | Full table scans; e.g., SELECT * FROM orders WHERE customer_id = $1 grows from 20 ms at 10 K rows to minutes at 10 M rows. |
| Bad join/aggregation plans | Planner misjudges cardinality and picks the wrong join strategy. |
| Resource contention | Lock contention, connection‑pool exhaustion, CPU/I/O saturation, memory pressure. |
| Plan regressions | Parameter‑sensitive plans, stale statistics after bulk loads. |
| N+1 problem | Many fast queries (e.g., 100 × 2 ms) add up to large latency. |
| Lack of context | Database tools show what is slow, not why it matters to the user‑facing service. |
The Value of Context‑Rich Traces
Distributed traces embed each database span in a request context (service, endpoint, user). Instead of correlating logs and traces after the fact, we can analyze slow queries directly from traces with full application context.
Building Blocks
- OpenTelemetry Collector paired with
docker-otel-lgtm(Grafana stack: Loki, Grafana, Tempo, Mimir). - Sample app: Go‑based “Album API” serving music album data from PostgreSQL, instrumented with
otelsql. - Three dashboards:
- Queries by duration
- Queries weighted by traffic (impact)
- Anomaly detection
Lab Setup
git clone https://github.com/causely-oss/slow-query-lab
cd slow-query-lab
docker-compose up -d
Once running, open Grafana at http://localhost:3001 to explore the dashboards.
Dashboard 1 – Slow SQL by Duration
TraceQL query
{ span.db.system != "" } | select(span.db.query.text, span.db.statement)
- Groups by root operation (API endpoint) and SQL statement.
- Aggregates duration into mean, max, and count.
- Sorts by average duration (slowest first).
What You Get
- Table of slowest queries with full application context.
- Count of occurrences.
- Click‑through to individual traces for debugging.
Limitation
Sorting by average duration ignores traffic volume. A query that runs 2 s five times looks “worse” than a 150 ms query that runs 10 K times, even though the latter impacts more users.
Dashboard 2 – Traffic‑Weighted Impact
Impact formula
Impact = Avg Duration × Count
The same TraceQL query is used, but we add a calculated Impact field and sort by it.
Added Insights
- Service breakdown – which service triggered each query.
- Latency distribution – visualize duration over time.
- Top queries by impact – prioritize optimization work.
Why It Matters
High‑volume, moderately‑slow queries surface above rare‑but‑slow ones, giving a defensible answer to “which slow queries should we optimize first?”.
Dashboard 3 – Anomaly Detection
To answer “what has changed?” we distill metrics from spans using the spanmetrics connector.
Collector Configuration (excerpt)
connectors:
spanmetrics:
dimensions:
- name: db.system
default: "unknown"
- name: db.query.text
- name: db.statement
- name: db.name
default: "unknown"
exemplars:
enabled: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [transform, batch]
exporters: [spanmetrics, otlphttp/lgtm]
metrics:
receivers: [spanmetrics]
processors: [batch]
exporters: [otlphttp/lgtm]
The connector generates histogram metrics for query latency, labeled by:
service_name– originating servicedb_system– e.g.,postgresqldb_query_text/db_statement– the SQL query (template)db_name– database name
These metrics are stored in Mimir (Prometheus‑compatible) where we apply PromQL‑based anomaly detection.
Prometheus Recording Rules (from Grafana’s Anomaly Detection framework)
- Baseline – smoothed average of historical values.
- Upper/Lower bands – baseline ± N standard deviations.
- Anomaly – current value exceeds the bands.
Anomaly Detection Dashboard
- Plots current latency against adaptive baseline bands.
- Highlights anomalies per query.
- Shows per‑query breakdown for quick identification.
How It Answers “What Has Changed?”
- A query that is always slow (e.g., baseline = 450 ms) does not trigger an anomaly when it stays at 450 ms.
- A normally fast query (baseline = 50 ms) triggers an anomaly if it spikes to 200 ms.
Metrics Hygiene
-
Cardinality explosion – raw SQL in metric labels creates a series per literal value. Mitigate by:
- Using prepared statements (captures templates, not literals).
- Normalizing query text.
- Setting
aggregation_cardinality_limitin the spanmetrics connector.
-
Sensitive data – redact or drop sensitive attributes in the Collector before exporting.
-
Baseline warm‑up – adaptive rules need 24–48 h of data; start with wider bands and tighten over time.
From Symptoms to Root Causes
Even with anomaly detection you’re seeing symptoms. Real‑world incidents often involve multiple symptoms that need correlation:
- Missing index after a schema migration
- Query‑plan regression due to stale statistics
- Lock contention from a concurrent batch job
- Resource pressure from a noisy neighbor
- Upstream service degradation causing retry storms
Manual triage is possible but time‑consuming and doesn’t scale.
Enter Causely
Causely automates the pattern we built:
- Distill – extracts slow queries and other symptoms.
- Causal model – connects symptoms to system dependencies (endpoints, users, upstream services).
- Root‑cause identification – traces causation chains (e.g., “search queries are slow because the index was dropped”).
- Actionable recommendations – via AskCausely you get concrete next steps (add index, revert deployment, address upstream pressure).
The distill‑detect‑surface pipeline is the foundation; Causely extends it to full root‑cause analysis at scale.
Conclusion
By turning OpenTelemetry database spans into metrics, applying traffic‑weighted impact scoring, and layering anomaly detection, you gain:
- Immediate visibility into slow queries with full application context.
- Prioritization based on real user impact.
- Automated detection of abnormal behavior.
Combine this with Causely’s causal reasoning to move from symptoms to root causes, enabling faster, data‑driven performance engineering and incident response.
Try it yourself: ask Causely about slow queries and see how it connects them to their root causes.