I got tired of monitoring blind spots, so I built something to find them

Published: (March 15, 2026 at 07:04 PM EDT)
3 min read
Source: Dev.to

Source: Dev.to

Cover image for I got tired of monitoring blind spots, so I built something to find them

Problem statement

We have automated checks for code quality, security, and test coverage, but for monitoring we just hope it’s fine.

Last year I was on call when a critical service went down. It took us a while to even get paged because the alert that should have caught it had been disabled three months earlier during a maintenance window. Nobody re‑enabled it and no one noticed.

After the post‑mortem we dug through PagerDuty and Datadog configurations and discovered:

  • Escalation policies pointing to people who had left the company.
  • Zero alert rules (metrics were being collected but never alerted on).
  • Notification channels that weren’t referenced by any policy.
  • Dashboard panels stuck in a permanent “no data” state that everyone had learned to ignore.

We had dashboards, we had monitoring, we had alerts… but we had no way to know what we were missing.

Solution overview

I built a tool that connects to the monitoring stack (PagerDuty, Datadog, Grafana, Sentry, New Relic, etc.) and runs a gap analysis. Instead of asking “are your services up?”, it asks:

  • Do your services actually have alerts configured?
  • Are those alerts routed to useful destinations?

The system pulls configurations through each tool’s API and checks for:

  • Services with no escalation policy.
  • Alert rules without a notification channel.
  • Monitors that haven’t received data in 30+ days.
  • Scheduled searches with alerting disabled.

Each issue is assigned a severity (critical / warning / info) and a concrete fix suggestion generated by AI. The tool also scores your setup across several dimensions—alert coverage, notification routing, dashboard health—so you can see the biggest gaps in a single pane of glass.

AI‑driven recommendations

AI generates prioritized remediation steps and an Incident Autopilot that, when a symptom is described (e.g., “checkout is slow”), maps the blast radius across services, identifies who’s on call, and builds an investigation playbook.

PR/MR scanner

A recent addition integrates with GitHub/GitLab webhooks. When a PR adds a new API endpoint or database connection, the scanner flags it and suggests the monitors you should add before merging.

Open questions

  • Is the problem painful enough? Most teams I’ve spoken to know their monitoring has gaps, but would they connect a third‑party tool to audit it, or just accept the risk?
  • Script vs. platform: Some say “I could write a script for this.” While a single script for PagerDuty escalation policies is doable, maintaining scripts for Datadog monitors, Grafana alert rules, Sentry project configs, etc., quickly becomes a maintenance nightmare.
  • Security concerns: The system requires read‑only API tokens to your monitoring tools. Tokens are encrypted at rest and never stored in plaintext, but the trust barrier is real.

If you want to try it, the live demo (no account required) is at . Click Enter Demo and explore with synthetic data.

I’d love to hear from anyone who has dealt with monitoring coverage gaps. How do you handle it today? Is it just tribal knowledge and hope?

0 views
Back to Blog

Related posts

Read more »

Travigo

Travel as fast as you speak with Gemini! Where live agents meet immersive storytelling & 3D navigation. This project was created for entering the Gemini Live Ag...

Micro games

Hey Gamers! 👾 As part of the Rapid Games Prototyping module, we are tasked with reviewing a peer's game. The challenge is to analyse a prototype built in just...