How I Built a Distributed Uptime Monitoring System with FastAPI

Published: 2 months ago (March 5, 2026 at 04:34 AM EST)

4 min read

Source: Dev.to

Source: Dev.to

The Real Problem With Uptime Monitoring

Most uptime monitoring tools work like this:

A single server sends a request to your endpoint every few minutes.
If the request fails, the system declares downtime.

Simple. Also very wrong.

A single monitor cannot reliably determine whether an application is actually down. Network routing issues, DNS delays, or temporary congestion can produce false downtime alerts even when the service is functioning normally. In production environments, false positives create a serious problem:

Engineers lose trust in the monitoring system.
Alerts stop being useful.

When I started building TrustMonitor, the first design constraint was simple:

The monitoring system itself must be reliable enough to be trusted.

Architecture Overview

Instead of relying on a single monitor, the system uses a distributed verification approach. The monitoring flow looks like this:

Scheduler
   ↓
Primary Monitor
   ↓
Secondary Verification
   ↓
Incident Recording
   ↓
Signed Incident Report

Each stage reduces the probability of false alerts and ensures that incidents cannot be modified after they are recorded.

Monitor Scheduling

The system uses a scheduler responsible for dispatching monitoring jobs at defined intervals.

Each job contains:

endpoint URL
expected_status code
timeout threshold

Example structure:

{
  "endpoint": "https://api.example.com/health",
  "expected_status": 200,
  "timeout": 5
}

The scheduler pushes these jobs into a queue where worker nodes perform the actual checks. Separating scheduling from execution prevents monitoring delays if a worker becomes slow or temporarily unavailable.

Primary Monitor

The primary monitor sends the initial request to the target endpoint. In the current implementation, this is handled using FastAPI workers running asynchronous HTTP checks.

Example simplified check:

import httpx

async def check_endpoint(url):
    async with httpx.AsyncClient(timeout=5) as client:
        response = await client.get(url)
        return response.status_code

If the response matches the expected conditions, the monitor records a successful check. If not, the system does not immediately declare downtime—this is where most monitoring tools fail.

Secondary Verification

Before an incident is recorded, a secondary verification monitor repeats the check. This step confirms whether the failure is real or caused by temporary network conditions.

Verification logic:

Primary Monitor detects failure
        ↓
Secondary Monitor runs verification check
        ↓
If failure confirmed → incident recorded
If success → ignore false positive

This simple mechanism significantly reduces false downtime alerts.

Incident Recording

Once the failure is verified, the system records an incident containing:

timestamp
endpoint
failure reason
verification results

Example incident structure:

{
  "endpoint": "api.example.com",
  "status": "DOWN",
  "timestamp": "2026-03-05T10:20:15Z",
  "verified": true
}

Recording incidents alone is not enough; monitoring systems must also guarantee data integrity.

Cryptographic Incident Signing

A key design decision in TrustMonitor is that incident records are cryptographically signed. This prevents incidents from being altered later. Each incident is hashed using a cryptographic digest.

Conceptual flow:

incident_data → SHA256 → incident_signature

The signature proves that the incident existed at a specific time and has not been modified. This is useful for:

post‑incident audits
SLA verification
infrastructure debugging

Lessons Learned

Single-location monitoring is unreliable
Network issues happen constantly. A single monitor cannot determine service health with certainty. Verification layers are essential.

Monitoring systems must be trustworthy
If alerts generate too many false positives, engineers eventually ignore them. A monitoring system that isn’t trusted is worse than having none at all.

Incident integrity matters
Monitoring data should be tamper‑resistant. Signed incidents create verifiable records of infrastructure events.

Final Thoughts

Monitoring infrastructure sounds simple on paper. In practice, reliability requires careful design around:

verification
distributed checks
incident integrity

TrustMonitor is still evolving, but building it has already surfaced interesting engineering challenges around monitoring accuracy and system trust.

Future improvements will focus on:

multi‑region verification
anomaly detection
improved alert reliability

Because in monitoring systems, trust is everything.

How I Built a Distributed Uptime Monitoring System with FastAPI

The Real Problem With Uptime Monitoring

Architecture Overview

Monitor Scheduling

Primary Monitor

Secondary Verification

Incident Recording

Cryptographic Incident Signing

Lessons Learned

Final Thoughts

Related posts

FastAPI vs Django vs Flask: Elegir el marco web Python correcto en 2026

Python Sample HTTP CRUD with FastAPI and Flask

How to Build a Custom MCP Tool in Under 10 Min

Automating Earthworks Analysis with Python and KML Data