可观测性实践：使用 Python 和 Prometheus 实现真实世界监控

发布: 1天前 (2025年12月4日 GMT+8 10:19)

4 min read

原文: Dev.to

Source: Dev.to

什么是可观测性？

可观测性是指基于系统产生的数据来理解其内部状态的能力。它围绕三个核心支柱构建：

1. 指标

反映系统状态的数值。
示例：请求延迟、CPU 使用率、内存消耗。

2. 日志

由应用程序和系统生成的详细事件记录。
示例：认证信息、错误、警告。

3. 跟踪

跨服务的端到端请求追踪。
在微服务和分布式系统中非常有用。

它们共同帮助回答：

正在发生什么？
为什么会发生？
哪里出现了故障？

为什么可观测性很重要

更早发现问题
减少停机时间
提升性能
理解用户影响
大规模监控应用程序
做出数据驱动的决策

没有可观测性，调试会变得缓慢、被动且不一致。

实际案例：使用 Python + Prometheus 实现可观测性

安装依赖

pip install fastapi uvicorn prometheus-client

带有 Prometheus 指标的 Python API

from fastapi import FastAPI
from prometheus_client import Counter, Histogram, generate_latest
from fastapi.responses import Response
import time
import random

app = FastAPI()

REQUEST_COUNT = Counter("api_requests_total", "Total number of API requests received")
REQUEST_LATENCY = Histogram("api_request_latency_seconds", "API request latency")

@app.get("/")
def home():
    REQUEST_COUNT.inc()
    with REQUEST_LATENCY.time():
        time.sleep(random.uniform(0.1, 0.5))
        return {"message": "API is running successfully"}

@app.get("/metrics")
def metrics():
    return Response(generate_latest(), media_type="text/plain")

暴露的指标

指标 (Metric)	描述 (Description)
`api_requests_total`	统计所有进入的请求
`api_request_latency_seconds`	测量请求持续时间（秒）

Prometheus 配置

创建 prometheus.yml：

global:
  scrape_interval: 5s

scrape_configs:
  - job_name: "python-api"
    static_configs:
      - targets: ["localhost:8000"]

Prometheus 将在以下地址抓取指标端点：

运行 Prometheus

./prometheus --config.file=prometheus.yml

打开 Prometheus UI 并查询指标，例如：

api_requests_total
rate(api_requests_total[1m])
api_request_latency_seconds_bucket

可选：Grafana 仪表盘

Grafana 可以可视化你的 Prometheus 指标。常见图表包括：

请求速率
CPU 与内存使用情况
错误比例
延迟分位数（p95、p99）

可观测性最佳实践

✔ 为每个主要端点埋点 – 为性能关键的 API 暴露指标。
✔ 标准化指标名称 – 避免随意或结构混乱的命名。
✔ 添加标签（标签） – 如 status_code、endpoint、method，提供更丰富的上下文。
✔ 使用告警 – 例如 “95 百分位延迟超过 500 ms 持续 3 分钟”。
✔ 可视化一切 – 仪表盘让模式一目了然。
✔ 结合日志、指标和跟踪 – 当三大支柱齐备时，可观测性效果最佳。

结论

可观测性让团队深入了解系统行为。使用 Prometheus + FastAPI，你可以暴露有价值的指标，从而支持：

更快的调试
更好的性能洞察
更安全的部署
可扩展的系统监控

该示例可以进一步扩展为跟踪（OpenTelemetry）、日志管道（ELK Stack）或全栈可观测性平台，如 AWS CloudWatch、Datadog 或 Azure Monitor。

参考资料

Prometheus 文档 –
Grafana 文档 –
FastAPI –
OpenTelemetry –