How I Built a Distributed Uptime Monitoring System with FastAPI

Published: (March 5, 2026 at 04:34 AM EST)
4 min read
Source: Dev.to

Source: Dev.to

The Real Problem With Uptime Monitoring

Most uptime monitoring tools work like this:

  • A single server sends a request to your endpoint every few minutes.
  • If the request fails, the system declares downtime.

Simple. Also very wrong.

A single monitor cannot reliably determine whether an application is actually down. Network routing issues, DNS delays, or temporary congestion can produce false downtime alerts even when the service is functioning normally. In production environments, false positives create a serious problem:

  • Engineers lose trust in the monitoring system.
  • Alerts stop being useful.

When I started building TrustMonitor, the first design constraint was simple:

  • The monitoring system itself must be reliable enough to be trusted.

Architecture Overview

Instead of relying on a single monitor, the system uses a distributed verification approach. The monitoring flow looks like this:

Scheduler

Primary Monitor

Secondary Verification

Incident Recording

Signed Incident Report

Each stage reduces the probability of false alerts and ensures that incidents cannot be modified after they are recorded.

Monitor Scheduling

The system uses a scheduler responsible for dispatching monitoring jobs at defined intervals.

Each job contains:

  • endpoint URL
  • expected_status code
  • timeout threshold

Example structure:

{
  "endpoint": "https://api.example.com/health",
  "expected_status": 200,
  "timeout": 5
}

The scheduler pushes these jobs into a queue where worker nodes perform the actual checks. Separating scheduling from execution prevents monitoring delays if a worker becomes slow or temporarily unavailable.

Primary Monitor

The primary monitor sends the initial request to the target endpoint. In the current implementation, this is handled using FastAPI workers running asynchronous HTTP checks.

Example simplified check:

import httpx

async def check_endpoint(url):
    async with httpx.AsyncClient(timeout=5) as client:
        response = await client.get(url)
        return response.status_code

If the response matches the expected conditions, the monitor records a successful check. If not, the system does not immediately declare downtime—this is where most monitoring tools fail.

Secondary Verification

Before an incident is recorded, a secondary verification monitor repeats the check. This step confirms whether the failure is real or caused by temporary network conditions.

Verification logic:

Primary Monitor detects failure

Secondary Monitor runs verification check

If failure confirmed → incident recorded
If success → ignore false positive

This simple mechanism significantly reduces false downtime alerts.

Incident Recording

Once the failure is verified, the system records an incident containing:

  • timestamp
  • endpoint
  • failure reason
  • verification results

Example incident structure:

{
  "endpoint": "api.example.com",
  "status": "DOWN",
  "timestamp": "2026-03-05T10:20:15Z",
  "verified": true
}

Recording incidents alone is not enough; monitoring systems must also guarantee data integrity.

Cryptographic Incident Signing

A key design decision in TrustMonitor is that incident records are cryptographically signed. This prevents incidents from being altered later. Each incident is hashed using a cryptographic digest.

Conceptual flow:

incident_data → SHA256 → incident_signature

The signature proves that the incident existed at a specific time and has not been modified. This is useful for:

  • post‑incident audits
  • SLA verification
  • infrastructure debugging

Lessons Learned

Single-location monitoring is unreliable
Network issues happen constantly. A single monitor cannot determine service health with certainty. Verification layers are essential.

Monitoring systems must be trustworthy
If alerts generate too many false positives, engineers eventually ignore them. A monitoring system that isn’t trusted is worse than having none at all.

Incident integrity matters
Monitoring data should be tamper‑resistant. Signed incidents create verifiable records of infrastructure events.

Final Thoughts

Monitoring infrastructure sounds simple on paper. In practice, reliability requires careful design around:

  • verification
  • distributed checks
  • incident integrity

TrustMonitor is still evolving, but building it has already surfaced interesting engineering challenges around monitoring accuracy and system trust.

Future improvements will focus on:

  • multi‑region verification
  • anomaly detection
  • improved alert reliability

Because in monitoring systems, trust is everything.

0 views
Back to Blog

Related posts

Read more »

No right to relicense this project

Hi, I'm Mark Pilgrim. You may remember me from such classics as Dive Into Python and Universal Character Encoding Detector. I am the original author of chardet....

Relicensing with AI-Assisted Rewrite

Disclaimer I am not a lawyer, nor am I an expert in copyright law or software licensing. The following post is a breakdown of recent community events and legal...