Using AI for Troubleshooting: OpenAI vs DeepSeek
Source: Dev.to
AI‑Powered Root‑Cause Analysis
AI is now a go‑to tool for everything from writing to coding. Modern LLMs are so powerful that, with the right prompt and a few adjustments, they can handle tasks almost effortlessly.
At Coroot (an open‑source project to simplify observability) we’ve developed AI (LLMs + ML) to diagnose root‑cause analysis. Finding the exact root cause is tricky because there are many possible scenarios. Many of our users aren’t experts in databases, network performance, or specific runtimes, so we need to provide clear explanations—and ideally, guidance on how to fix the problem.
In the following blog I decided to test the ability of different LLMs to diagnose a chaos experiment. Since laziness often drives progress, I skipped converting my telemetry data into a prompt and instead just uploaded a screenshot as input for the LLMs.
The result shocked me.
OpenAI GPT‑4o
Prompt (no other context provided):
Answer
From the provided PostgreSQL monitoring dashboard, it appears that an ALTER TABLE statement is causing performance issues, leading to:
- Increased Query Latency – the average query latency has spiked, especially for
pg‑cluster‑2. - I/O Load from
ALTER TABLE– the “ALTER TABLE” statement (alter table articles alter column body set not null) is consuming significant I/O (see Queries by I/O time graph). - Locked Connections and Queries
- The Postgres connections graph shows a sharp rise in locked connections.
- The Locked queries graph confirms that queries are blocked due to the
ALTER TABLEoperation. - The Blocking queries graph indicates that the
ALTER TABLEquery is holding a lock and blocking multiple other queries.
- Replication Lag – a spike, likely caused by the backlog of writes generated by the
ALTER TABLEoperation.
Why is this happening?
ALTER TABLE … SET NOT NULL is a table‑wide operation that requires a full table scan to verify that no existing row violates the new constraint. It locks the entire table and prevents reads/writes from proceeding until the operation completes. This is especially problematic in a high‑load environment where frequent reads/writes occur.
How to fix this (immediate actions)
-- Cancel the ALTER TABLE if it is causing downtime
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE query LIKE '%ALTER TABLE articles alter column body set not null%';
After cancelling, check for lingering locks:
SELECT *
FROM pg_stat_activity
WHERE wait_event IS NOT NULL;
Long‑term fixes
Because ALTER TABLE is expensive, consider the following best practices:
-
Convert to an online‑friendly approach
-- 1️⃣ Add a non‑validating check constraint (non‑blocking) ALTER TABLE articles ADD CONSTRAINT articles_body_not_null CHECK (body IS NOT NULL) NOT VALID; -- 2️⃣ Backfill missing values (if any) -- 3️⃣ Validate the constraint (still non‑blocking) ALTER TABLE articles VALIDATE CONSTRAINT articles_body_not_null; -- 4️⃣ If no violations exist, set the column NOT NULL ALTER TABLE articles ALTER COLUMN body SET NOT NULL; -
Use
pg_repackfor zero‑downtime migrationspg_repack -t articles --dbname=mydatabase -
Perform schema changes during low‑traffic periods – schedule maintenance windows for production environments.
-
Partition large tables – isolate operations on smaller segments to reduce impact.
Conclusion
The ALTER TABLE operation is locking the articles table, causing significant query delays and replication lag. The recommended approach is to:
- Cancel the operation if it is causing downtime.
- Use an online‑safe alternative (check‑constraint + validation).
- Schedule schema changes carefully and consider tools like
pg_repackfor minimal impact.
DeepSeek‑R1
Prompt: (same screenshot)
The answer
The data indicates a significant performance issue in pg‑cluster‑2, primarily caused by a long‑running ALTER TABLE command on the articles table. This operation is:
- Holding an exclusive lock on the
articlestable. - Blocking other critical queries (
SELECT,INSERT). - Causing high query latency (10‑15 s), a surge in active/locked connections, throughput drops, and minor replication lag in other clusters.
Immediate fixes
-- Find the PID of the ALTER TABLE query
SELECT pid, query
FROM pg_stat_activity
WHERE query ILIKE '%ALTER TABLE articles alter column body set not%';
-- Terminate it (if safe)
SELECT pg_terminate_backend(pid);
After termination, verify that locks have cleared:
SELECT *
FROM pg_stat_activity
WHERE wait_event IS NOT NULL;
Longer‑term recommendations
- Follow the same online‑friendly approach described for GPT‑4o.
- Use
pg_repackor similar tools for zero‑downtime schema changes. - Schedule heavy DDL operations during maintenance windows.
- Consider table partitioning for very large, frequently‑altered tables.
Final Thoughts
Both models produced impressively accurate analyses from a single screenshot. The ability of an LLM to “understand” our domain and generate deep, actionable summaries is exciting. However, relying on a single vendor isn’t a robust solution.
DeepSeek’s R1 model proved to be a strong alternative to OpenAI’s offering, giving us confidence that multi‑vendor strategies can provide reliable, vendor‑agnostic AI‑driven observability.
DeepSeek‑R1 delivers an answer that is quite close to OpenAI’s. I’d say it is accurate enough to be a viable alternative to OpenAI’s models.
Additionally, I tested other open‑source models in my lab, but I was only able to run LLMs with up to 70 billion parameters. In this range, I didn’t find anything that came close to OpenAI or the larger DeepSeek models for our needs.
Blocking ALTER TABLE Issue
-- Find the blocking query
SELECT pid, query
FROM pg_stat_activity
WHERE query LIKE '%alter column body%';
-- Terminate it
SELECT pg_terminate_backend(pid);
Caution: Ensure this won’t corrupt data (e.g., if the ALTER is part of a migration).
Reschedule the DDL Operation
Run schema changes during low‑traffic periods or use online DDL tools such as:
pg_repack– rebuild tables without exclusive locks.CREATE INDEX CONCURRENTLY– perform index‑related changes without blocking writes.
Optimize the ALTER TABLE
If the column is being set to NOT NULL, make sure it already contains no NULL values. Pre‑run a cleanup like:
UPDATE articles
SET body = ''
WHERE body IS NULL; -- Example cleanup
Long‑Term Improvements
Use Lock Timeouts
SET LOCAL lock_timeout = '5s';
ALTER TABLE ...;
Monitor and Tune Queries
- Optimize the query
SELECT id FROM articles WHERE created ...
Is AIOps finally here?
Yes, but only if you have enough data describing your system’s behaviour.
This wouldn’t have been possible with just CPU, memory, and disk usage alone. The real magic of AI in this case is its ability to understand the bigger picture, apply domain knowledge, and suggest how to fix the issue.
Coroot’s AI‑Powered Root Cause Analysis
Coroot combines eBPF + Machine Learning to capture context that LLMs alone miss. As part of our commitment to make observability simpler and accessible for everyone, you can set it up for free with our open‑source version.
We hope it can help save you hours of digging through telemetry and make root‑cause analysis easier for your team.
