AWS re:Invent 2025 - Troubleshooting database performance issues at scale (COP331)

Published: (December 5, 2025 at 01:47 AM EST)
3 min read
Source: Dev.to

Source: Dev.to

Introduction

In this session, Joe Alioto, a Senior Specialist Solutions Architect, demonstrates how to troubleshoot database performance issues at scale. Using the story of Alex, an SRE lead at Ocktank, the presentation covers fleet‑wide observability across PostgreSQL, MySQL, and SQL Server, unified monitoring, query analysis, lock investigation, and application‑level signals for end‑to‑end transaction analysis.

Key takeaways

  • Visualizing database load with honeycomb charts.
  • Identifying problematic queries, users, and hosts.
  • Examining lock trees to uncover blocking scenarios.
  • Leveraging Database Insights and application performance monitoring (APM) to pinpoint whether issues originate upstream or within the database.

Typical troubleshooting workflow

  1. Identify affected instances from the fleet view.
  2. Analyze queries and locks.
  3. Review relevant metrics (CPU, memory, I/O, etc.).
  4. Use APM to determine if the root cause is in the application layer.

Meet Alex

Alex is an SRE lead at Ocktank responsible for supporting a diverse set of databases—PostgreSQL, MySQL, and Microsoft SQL Server—across multiple teams and regions. Common challenges he faces include:

  • Lack of application context when teams report “the retail service is slow.”
  • Managing a growing number of specialized tools for each database engine.
  • Quickly isolating the source of performance problems without extensive back‑and‑forth investigation.

Fleet‑Wide Monitoring

Alex uses a unified monitoring solution that provides:

  • Cross‑account and cross‑region visibility (currently three regions at a time).
  • Honeycomb visualizations showing average and maximum load per database instance. Colors range from white (low load) to red (high load).
  • Customizable fleet views, e.g., “Retail Prod Application,” that surface the top‑10 instances by load.
  • Event streams displaying restarts, failures, and severity levels.
  • Application integration metrics, such as failure rates for specific services (e.g., a 45 % failure rate for the retail service).

Top‑10 metric tables can be defined by CPU, memory, disk, network I/O, or any other relevant dimension, allowing Alex to quickly spot outliers and drill down into individual instances.

Identifying High‑Load Queries

When Alex selects an instance with elevated load, the UI presents a time‑series view of query activity:

  • Granular per‑second data at the top of the screen.
  • Aggregated view for the selected time span (e.g., one hour) at the bottom.

By sorting queries, Alex discovers a recurring SELECT * FROM orders statement. Further inspection reveals:

  • User: dba_intern
  • Source hosts: two distinct machines
  • Application: the internal db-operations tool

Armed with the who/what/when/where information, Alex can contact the responsible team to verify whether the query is intentional or needs remediation.

Investigating Locks

In a separate scenario, the sales team reports that report generation is stalled. Alex locates the affected instance and opens Database Insights to examine lock activity:

  • Lock tables and lock trees display blocking relationships.
  • The view highlights which sessions hold locks and which are waiting, enabling rapid identification of deadlocks or long‑running transactions.

By correlating lock data with query and user information, Alex can determine whether the bottleneck stems from a specific application component or from database contention.

Troubleshooting Workflow Summary

  1. Fleet View: Scan honeycomb charts and top‑10 metric tables to locate anomalous instances.
  2. Instance Drill‑Down: Review time‑series query data to spot high‑frequency or long‑running statements.
  3. User & Host Attribution: Identify the users, hosts, and applications generating the load.
  4. Lock Analysis: Use Database Insights to visualize lock trees and resolve contention.
  5. Application Performance Monitoring: Cross‑reference database metrics with APM traces to confirm if the issue originates upstream.

This systematic approach reduces mean‑time‑to‑resolution (MTTR) and minimizes the need for manual, tool‑specific investigations.

Conclusion

By unifying observability across multiple database engines and integrating application‑level signals, teams like Alex’s can quickly pinpoint performance bottlenecks—whether they are caused by inefficient queries, lock contention, or upstream application issues. The workflow demonstrated in this session enables scalable, end‑to‑end troubleshooting for large fleets of databases.

Back to Blog

Related posts

Read more »

AI-Powered Development Platform

🤔 The Problem That Kept Me Up at Night Picture this: You discover an awesome open‑source project on GitHub. It has 10,000+ issues, hundreds of contributors, a...