[Paper] A monitoring system for collecting and aggregating metrics from distributed clouds

Published: (March 5, 2026 at 09:56 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2603.05241v1

Overview

The paper presents a full‑stack monitoring system built for distributed clouds (DCs)—a new cloud paradigm where ad‑hoc, geographically‑spread micro‑clouds can be spun up and torn down on demand. By instrumenting every node with lightweight agents and funneling the data into a central control plane, the authors deliver real‑time observability across machines, containers, and applications, making DCs practical for latency‑sensitive, data‑intensive workloads.

Key Contributions

  • Unified metric collection across three layers (host, container, application) via on‑node agents.
  • Health‑check protocol that securely pushes metrics from nodes to the DC control plane without adding significant overhead.
  • Multi‑modal API surface (REST, gRPC, streaming) enabling diverse consumers (dashboards, autoscalers, CI pipelines).
  • Per‑DC aggregation that synthesizes node‑level data into high‑level health indicators and capacity forecasts.
  • Open‑source reference implementation that can be dropped into existing Kubernetes‑based or container‑centric DC deployments.

Methodology

  1. Agent Design – A small daemon runs on each compute node, leveraging existing Linux tooling (cAdvisor, Prometheus exporters) to scrape CPU, memory, network, and custom application metrics.
  2. Data Transport – During the periodic health‑check, agents batch metrics and send them over TLS‑encrypted HTTP/2 to the DC’s control‑plane service.
  3. Persistence & Indexing – The control plane stores raw time‑series in a scalable columnar store (e.g., ClickHouse) while also maintaining aggregated summaries in an in‑memory cache for low‑latency queries.
  4. API Layer – Three endpoints are exposed:
    • REST for pull‑based queries and historical analysis.
    • gRPC for low‑overhead programmatic access.
    • WebSocket/Server‑Sent Events for continuous streaming of live metrics.
  5. Aggregation Logic – Metrics from all nodes belonging to the same DC are combined using weighted averages, percentile calculations, and anomaly detection heuristics to produce a “DC health score”.

The authors evaluated the system on a testbed of 50 nodes spread across three geographic regions, measuring both the overhead introduced by the agents and the latency of metric delivery.

Results & Findings

MetricBaseline (no monitoring)With monitoringΔ
CPU overhead per node0 %1.2 %+1.2 %
Network traffic (health‑check payload)0 KB/s45 KB/s+45 KB/s
End‑to‑end metric latency (node → API)≈ 150 ms (99th percentile)
Query latency for aggregated DC view≤ 30 ms (cached)
  • Negligible performance impact: The agents consume <2 % of CPU and <50 KB/s of bandwidth, well within typical cloud VM budgets.
  • Fast delivery: Real‑time streaming APIs provide sub‑200 ms latency, enabling responsive autoscaling and SLA monitoring.
  • Scalable aggregation: Even with 10 k metrics per node, the control plane sustains >10 k queries per second without degradation.

Practical Implications

  • Autoscaling & Self‑Healing: Developers can hook the streaming API into custom controllers that trigger scale‑out/in actions or restart faulty containers the moment an anomaly is detected.
  • SLA Verification: Service owners can query the aggregated DC health score to prove compliance with latency or availability guarantees across regions.
  • Cost Optimization: By correlating resource usage with workload patterns at the DC level, operators can right‑size ad‑hoc clouds before they become financially wasteful.
  • Multi‑Tenant Observability: The API’s fine‑grained namespace support lets different teams or customers view only their own metrics while the platform retains a global view for capacity planning.
  • Plug‑and‑Play Integration: Because the agents rely on standard exporters, existing Prometheus‑based tooling can be reused, shortening the learning curve for DevOps teams.

Limitations & Future Work

  • Security Scope: The current prototype assumes a trusted internal network; extending the model to zero‑trust environments (mutual TLS, fine‑grained RBAC) is left for future work.
  • Edge‑Scale Evaluation: Tests were limited to 50 nodes; scaling to thousands of edge devices may expose bottlenecks in the aggregation pipeline.
  • Metric Semantics: The system aggregates raw numbers but does not yet support higher‑level intent (e.g., “user‑perceived latency”) that could be derived from application logs.
  • Dynamic Topology: While the health‑check protocol tolerates node churn, the authors note that rapid, large‑scale reconfiguration of DC boundaries could stress the control plane’s consistency guarantees.

The authors plan to open‑source the full stack, add richer security primitives, and benchmark the platform on a truly global edge testbed in upcoming releases.

Authors

  • Tamara Ranković
  • Mateja Rilak
  • Janko Rakonjac
  • Miloš Simić

Paper Information

  • arXiv ID: 2603.05241v1
  • Categories: cs.DC
  • Published: March 5, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »