Using AI for Troubleshooting: OpenAI vs DeepSeek

Published: (January 7, 2026 at 01:52 PM EST)
6 min read
Source: Dev.to

Source: Dev.to

AI‑Powered Root‑Cause Analysis

AI is now a go‑to tool for everything from writing to coding. Modern LLMs are so powerful that, with the right prompt and a few adjustments, they can handle tasks almost effortlessly.

At Coroot (an open‑source project to simplify observability) we’ve developed AI (LLMs + ML) to diagnose root‑cause analysis. Finding the exact root cause is tricky because there are many possible scenarios. Many of our users aren’t experts in databases, network performance, or specific runtimes, so we need to provide clear explanations—and ideally, guidance on how to fix the problem.

In the following blog I decided to test the ability of different LLMs to diagnose a chaos experiment. Since laziness often drives progress, I skipped converting my telemetry data into a prompt and instead just uploaded a screenshot as input for the LLMs.

The result shocked me.

OpenAI GPT‑4o

Prompt (no other context provided):

PostgreSQL monitoring dashboard

Answer

From the provided PostgreSQL monitoring dashboard, it appears that an ALTER TABLE statement is causing performance issues, leading to:

  • Increased Query Latency – the average query latency has spiked, especially for pg‑cluster‑2.
  • I/O Load from ALTER TABLE – the “ALTER TABLE” statement (alter table articles alter column body set not null) is consuming significant I/O (see Queries by I/O time graph).
  • Locked Connections and Queries
    • The Postgres connections graph shows a sharp rise in locked connections.
    • The Locked queries graph confirms that queries are blocked due to the ALTER TABLE operation.
    • The Blocking queries graph indicates that the ALTER TABLE query is holding a lock and blocking multiple other queries.
  • Replication Lag – a spike, likely caused by the backlog of writes generated by the ALTER TABLE operation.

Why is this happening?

ALTER TABLE … SET NOT NULL is a table‑wide operation that requires a full table scan to verify that no existing row violates the new constraint. It locks the entire table and prevents reads/writes from proceeding until the operation completes. This is especially problematic in a high‑load environment where frequent reads/writes occur.

How to fix this (immediate actions)

-- Cancel the ALTER TABLE if it is causing downtime
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE query LIKE '%ALTER TABLE articles alter column body set not null%';

After cancelling, check for lingering locks:

SELECT *
FROM pg_stat_activity
WHERE wait_event IS NOT NULL;

Long‑term fixes

Because ALTER TABLE is expensive, consider the following best practices:

  1. Convert to an online‑friendly approach

    -- 1️⃣ Add a non‑validating check constraint (non‑blocking)
    ALTER TABLE articles
      ADD CONSTRAINT articles_body_not_null CHECK (body IS NOT NULL) NOT VALID;
    
    -- 2️⃣ Backfill missing values (if any)
    
    -- 3️⃣ Validate the constraint (still non‑blocking)
    ALTER TABLE articles
      VALIDATE CONSTRAINT articles_body_not_null;
    
    -- 4️⃣ If no violations exist, set the column NOT NULL
    ALTER TABLE articles
      ALTER COLUMN body SET NOT NULL;
  2. Use pg_repack for zero‑downtime migrations

    pg_repack -t articles --dbname=mydatabase
  3. Perform schema changes during low‑traffic periods – schedule maintenance windows for production environments.

  4. Partition large tables – isolate operations on smaller segments to reduce impact.

Conclusion

The ALTER TABLE operation is locking the articles table, causing significant query delays and replication lag. The recommended approach is to:

  • Cancel the operation if it is causing downtime.
  • Use an online‑safe alternative (check‑constraint + validation).
  • Schedule schema changes carefully and consider tools like pg_repack for minimal impact.

DeepSeek‑R1

Prompt: (same screenshot)

The answer

The data indicates a significant performance issue in pg‑cluster‑2, primarily caused by a long‑running ALTER TABLE command on the articles table. This operation is:

  • Holding an exclusive lock on the articles table.
  • Blocking other critical queries (SELECT, INSERT).
  • Causing high query latency (10‑15 s), a surge in active/locked connections, throughput drops, and minor replication lag in other clusters.

Immediate fixes

-- Find the PID of the ALTER TABLE query
SELECT pid, query
FROM pg_stat_activity
WHERE query ILIKE '%ALTER TABLE articles alter column body set not%';

-- Terminate it (if safe)
SELECT pg_terminate_backend(pid);

After termination, verify that locks have cleared:

SELECT *
FROM pg_stat_activity
WHERE wait_event IS NOT NULL;

Longer‑term recommendations

  • Follow the same online‑friendly approach described for GPT‑4o.
  • Use pg_repack or similar tools for zero‑downtime schema changes.
  • Schedule heavy DDL operations during maintenance windows.
  • Consider table partitioning for very large, frequently‑altered tables.

Final Thoughts

Both models produced impressively accurate analyses from a single screenshot. The ability of an LLM to “understand” our domain and generate deep, actionable summaries is exciting. However, relying on a single vendor isn’t a robust solution.

DeepSeek’s R1 model proved to be a strong alternative to OpenAI’s offering, giving us confidence that multi‑vendor strategies can provide reliable, vendor‑agnostic AI‑driven observability.

DeepSeek‑R1 delivers an answer that is quite close to OpenAI’s. I’d say it is accurate enough to be a viable alternative to OpenAI’s models.
Additionally, I tested other open‑source models in my lab, but I was only able to run LLMs with up to 70 billion parameters. In this range, I didn’t find anything that came close to OpenAI or the larger DeepSeek models for our needs.

Blocking ALTER TABLE Issue

-- Find the blocking query
SELECT pid, query
FROM pg_stat_activity
WHERE query LIKE '%alter column body%';

-- Terminate it
SELECT pg_terminate_backend(pid);

Caution: Ensure this won’t corrupt data (e.g., if the ALTER is part of a migration).

Reschedule the DDL Operation

Run schema changes during low‑traffic periods or use online DDL tools such as:

  • pg_repack – rebuild tables without exclusive locks.
  • CREATE INDEX CONCURRENTLY – perform index‑related changes without blocking writes.

Optimize the ALTER TABLE

If the column is being set to NOT NULL, make sure it already contains no NULL values. Pre‑run a cleanup like:

UPDATE articles
SET body = ''
WHERE body IS NULL;   -- Example cleanup

Long‑Term Improvements

Use Lock Timeouts

SET LOCAL lock_timeout = '5s';
ALTER TABLE ...;

Monitor and Tune Queries

  • Optimize the query SELECT id FROM articles WHERE created ...

Is AIOps finally here?

Yes, but only if you have enough data describing your system’s behaviour.

This wouldn’t have been possible with just CPU, memory, and disk usage alone. The real magic of AI in this case is its ability to understand the bigger picture, apply domain knowledge, and suggest how to fix the issue.

Coroot’s AI‑Powered Root Cause Analysis

Coroot combines eBPF + Machine Learning to capture context that LLMs alone miss. As part of our commitment to make observability simpler and accessible for everyone, you can set it up for free with our open‑source version.

Details here.

We hope it can help save you hours of digging through telemetry and make root‑cause analysis easier for your team.

Back to Blog

Related posts

Read more »