[Paper] A Cascaded Graph Neural Network for Joint Root Cause Localization and Analysis in Edge Computing Environments

Published: (March 1, 2026 at 11:57 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2603.01447v1

Overview

The paper presents a cascaded Graph Neural Network (GNN) designed to pinpoint faulty microservices (root‑cause localization) and identify the type of fault (root‑cause analysis) in large‑scale edge‑computing environments. By breaking down massive service dependency graphs into smaller, tightly‑coupled clusters, the authors achieve near‑constant inference latency while keeping diagnostic accuracy on par with heavyweight centralized models—an advance that could make real‑time AIOps feasible at the edge.

Key Contributions

  • Communication‑driven graph clustering that automatically partitions a full service graph into high‑interaction communities, reducing the amount of data each GNN must process.
  • Cascaded GNN architecture with two subnetworks: the first performs coarse‑grained root‑cause localization across clusters, and the second refines the diagnosis within the selected cluster for fault‑type classification.
  • Scalability proof: empirical evaluation shows inference time remains almost flat as the number of services grows, unlike traditional centralized GNNs whose latency scales linearly.
  • Benchmarking on realistic workloads using the MicroCERCL suite and large synthetic datasets generated by the iAnomaly simulator, demonstrating comparable accuracy to state‑of‑the‑art centralized models.
  • AIOps‑ready design that can be deployed on edge nodes with limited compute and memory, enabling on‑device anomaly diagnostics.

Methodology

  1. Graph Construction – Each microservice instance is a node; edges encode request‑level communication (e.g., RPC calls, message queues). Metrics such as latency, CPU, and error rates are attached as node/edge features.
  2. Clustering Phase – A lightweight, communication‑aware clustering algorithm (similar to modularity‑based community detection) groups nodes that exchange the most traffic. The result is a set of sub‑graphs that preserve the most critical dependencies while discarding weak cross‑cluster links.
  3. Cascaded GNN
    • Stage 1 (Localization) – A GNN operates on the cluster‑level graph (each cluster becomes a super‑node). It predicts which cluster likely contains the fault.
    • Stage 2 (Analysis) – A second, finer‑grained GNN runs only on the selected cluster’s internal graph to output the exact faulty service and its fault type (e.g., CPU saturation, memory leak, network partition).
  4. Training – The model is trained end‑to‑end on labeled anomaly traces from the MicroCERCL benchmark, using a multi‑task loss that jointly optimizes localization and classification accuracy.
  5. Inference – At runtime, only the relevant sub‑graph is loaded into memory, dramatically cutting message‑passing overhead and enabling sub‑second diagnosis even on modest edge hardware.

Results & Findings

MetricCentralized GNNCascaded GNN (Proposed)
Diagnostic Accuracy (localization + type)93.2 %92.5 %
Inference Latency (average)1.8 s (10k‑node graph)0.21 s (10k‑node graph)
Latency Growth (10k → 50k nodes)↑ 3.9×≈ 1.1× (near‑constant)
Memory Footprint4.2 GB0.9 GB
  • Accuracy trade‑off is minimal (<1 % drop) despite a 90 % reduction in latency.
  • The cascaded approach maintains stable latency as graph size scales, confirming the effectiveness of the clustering step.
  • Experiments on the iAnomaly‑generated datasets (up to 100 k services) show the same trend, indicating robustness to diverse workload patterns.

Practical Implications

  • Edge‑native AIOps: Operators can embed the model directly on edge gateways or micro‑gateway devices, allowing instant fault detection without shipping logs to a central cloud.
  • Reduced Bandwidth & Cost: Since only the relevant cluster’s data is processed locally, the amount of telemetry that needs to be streamed upstream is dramatically lowered.
  • Faster Remediation: Sub‑second diagnosis enables automated remediation loops (e.g., container restart, traffic rerouting) that meet strict Service‑Level Objectives (SLOs) for latency‑sensitive IoT applications.
  • Scalable Monitoring Platforms: Existing observability stacks (Prometheus, OpenTelemetry) can feed the same metrics into the cascaded GNN, extending their capabilities from alerting to root‑cause intelligence without a major architectural overhaul.
  • Vendor‑agnostic Deployment: The clustering algorithm works on any service mesh (Istio, Linkerd, Consul) because it only needs communication metadata, making the solution portable across cloud‑edge hybrid deployments.

Limitations & Future Work

  • Clustering Overhead: While lightweight, the initial clustering step still incurs a one‑time cost that may be non‑trivial for highly dynamic topologies; incremental clustering strategies are needed.
  • Label Dependency: The supervised training relies on accurately labeled fault instances, which can be scarce in production; semi‑supervised or self‑supervised extensions could broaden applicability.
  • Fault Propagation Modeling: The current design assumes faults manifest primarily within a single cluster; multi‑cluster cascading failures may require deeper hierarchical GNNs.
  • Hardware Heterogeneity: Evaluation was performed on x86 edge servers; assessing performance on ARM‑based edge devices and GPUs/TPUs remains an open question.

The authors suggest exploring adaptive cascade depths, online learning to continuously incorporate new anomaly patterns, and integration with policy‑driven remediation engines as next steps.

Authors

  • Duneesha Fernando
  • Maria A. Rodriguez
  • Rajkumar Buyya

Paper Information

  • arXiv ID: 2603.01447v1
  • Categories: cs.DC
  • Published: March 2, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »