[Paper] Which Types of Heterogeneity Matter for Root Cause Localization in Microservice Systems ?
Source: arXiv - 2604.26670v1
Overview
Microservice‑based applications generate a bewildering mix of logs, metrics, traces, and host‑level signals, making it hard to pinpoint the root cause when something goes wrong. The paper “Which Types of Heterogeneity Matter for Root Cause Localization in Microservice Systems?” digs into why existing diagnostic tools miss the mark: they treat all observability data and system components as if they were homogeneous. By systematically studying the different “flavors” of heterogeneity—both in the data itself and in the entities (services vs. hosts) that produce it—the authors design a new, more accurate fault‑localization framework called NexusRCL.
Key Contributions
- Comprehensive heterogeneity analysis – The authors break down heterogeneity into data‑level (logs, metrics, traces) and entity‑level (services, containers, VMs) dimensions, showing how each influences fault propagation.
- Empirical evidence of asymmetric cross‑layer propagation – Experiments on two real‑world microservice benchmarks reveal that failures often travel from services to hosts (or vice‑versa) in a highly directional manner.
- NexusRCL framework – A semi‑supervised, heterogeneous‑graph‑based model that treats services and hosts as distinct node types and captures their asymmetric dependencies.
- Event‑based abstraction layer – Converts raw observability streams into a unified “event” representation, preserving the richness of heterogeneous data while keeping the model tractable.
- Active learning for low labeling cost – The system queries the most informative instances for manual annotation, dramatically reducing the amount of labeled data needed.
- Strong empirical gains – On two industrial benchmark datasets, NexusRCL improves Top‑1 root‑cause localization accuracy by up to 49.85 % and Average Top‑5 accuracy by 32.70 % over the best prior methods.
Methodology
- Heterogeneity Taxonomy – Catalog the observable signals (metrics, logs, traces) and the system entities that generate them (microservices, containers, VMs, physical hosts).
- Fault Propagation Study – Using injected faults in two benchmark microservice suites, trace how anomalies spread across layers, quantifying the asymmetry of these flows.
- Graph Construction – Build a heterogeneous graph where nodes are either service or host entities. Edges encode observed dependencies (e.g., a service calling another service, a service running on a host).
- Event‑Based Feature Extraction – Aggregate raw time‑series data into discrete “events” (e.g., a spike in CPU usage, an error log entry). Each event is attached to the appropriate node type.
- Semi‑Supervised Learning – Train a graph neural network (GNN) on a small set of labeled fault instances. The model learns to propagate fault signals through the heterogeneous graph, respecting the asymmetric edge weights.
- Active Learning Loop – Identify the most uncertain nodes (those that would most improve the model if labeled) and ask a human operator to annotate them, iterating until performance plateaus.
All steps are designed to be implementable with open‑source GNN libraries (e.g., PyTorch Geometric) and standard observability pipelines (Prometheus, OpenTelemetry).
Results & Findings
| Metric | NexusRCL | Best Prior Art |
|---|---|---|
| Top‑1 Accuracy (A@1) | +49.85 % improvement | – |
| Average Top‑5 Accuracy (A@5) | +32.70 % improvement | – |
| Labeling effort (samples) | ~30 % of full dataset (active learning) | 100 % labeled |
- Cross‑layer dominance: Faults originating in services often manifest first as host‑level resource anomalies, and vice‑versa. Ignoring this leads to mis‑localization.
- Heterogeneous graph beats homogeneous models: Treating services and hosts as the same node type drops accuracy by ~15 %, confirming the importance of entity‑level distinction.
- Active learning cuts cost: With only a fraction of labeled incidents, NexusRCL reaches near‑optimal performance, making it practical for production environments where labeling is expensive.
Practical Implications
- Faster MTTR (Mean Time to Repair): By surfacing the true culprit (service or host) in the first few ranked candidates, ops teams can cut debugging time dramatically.
- Reduced observability storage: The event‑based abstraction means you don’t need to retain raw logs forever—only the distilled events needed for the graph.
- Vendor‑agnostic deployment: Since the framework only requires standard metrics, logs, and trace data, it can be layered on top of existing monitoring stacks (Prometheus, Jaeger, Elastic).
- Scalable to large fleets: Heterogeneous GNNs scale linearly with the number of nodes/edges, and the active learning loop keeps the training set small, so the approach works for thousands of microservices.
- Better capacity planning: Understanding asymmetric fault propagation helps architects design more resilient service‑to‑host mappings (e.g., avoiding “hot” hosts that amplify service failures).
Limitations & Future Work
- Benchmark scope: Evaluation uses two industrial microservice suites; results may vary on highly heterogeneous environments (e.g., edge‑cloud hybrids).
- Static dependency graph: NexusRCL assumes a relatively stable service‑host topology; dynamic scaling (auto‑scaling groups) could require frequent graph updates.
- Label quality dependence: While active learning reduces quantity, the approach still needs accurate human annotations for the queried events.
- Future directions suggested by the authors include:
- Extending the graph to capture network‑level entities (load balancers, service meshes).
- Incorporating causal inference techniques to further refine asymmetric propagation models.
- Evaluating the system in a continuous‑deployment pipeline where faults evolve over time.
Authors
- Runzhou Wang
- Shenglin Zhang
- Wenwei Gu
- Yongxin Zhao
- Chenyu Zhao
- Dan Pei
- Yuxuan Chen
- Yangyuxin Huang
Paper Information
- arXiv ID: 2604.26670v1
- Categories: cs.SE
- Published: April 29, 2026
- PDF: Download PDF