[Paper] MetaRCA: A Generalizable Root Cause Analysis Framework for Cloud-Native Systems Powered by Meta Causal Knowledge
Source: arXiv - 2603.02032v1
Overview
Root‑cause analysis (RCA) in cloud‑native environments is notoriously hard because services are highly distributed, constantly evolving, and generate massive streams of telemetry. The paper MetaRCA proposes a new framework that builds a reusable “meta” causal graph from a blend of large‑language‑model knowledge, historic incident reports, and live observability data. By separating the heavy‑weight knowledge construction from the lightweight online inference, MetaRCA delivers accurate, fast fault localization even as system complexity grows.
Key Contributions
- Meta Causal Graph (MCG): A metadata‑level, system‑agnostic knowledge base that captures causal relationships between services, components, and metrics.
- Evidence‑driven graph construction: An algorithm that fuses LLM‑generated hypotheses, past failure tickets, and real‑time monitoring data to automatically populate and continuously refine the MCG.
- Dynamic instantiation: At fault time, the MCG is pruned and weighted using the current context, turning a massive global graph into a compact, inference‑ready sub‑graph.
- Scalable online inference: The runtime step runs in near‑linear time with respect to the number of involved services, making it practical for large production clusters.
- Strong empirical results: On 311 real‑world failures (252 public, 59 production) MetaRCA outperforms the best prior RCA baseline by 29 pp (service‑level) and 48 pp (metric‑level) and retains >80 % accuracy when transferred to completely different system topologies.
Methodology
1. Offline Knowledge Mining
- LLM prompting: The authors query a large language model with system documentation and architectural diagrams to obtain candidate causal edges (e.g., “service A latency ↑ → downstream service B timeout”).
- Historical fault mining: Past incident tickets and logs are parsed to extract observed cause‑effect pairs, which are then validated against the LLM suggestions.
- Observability correlation: Time‑series of metrics (CPU, latency, error rates) are statistically analyzed to confirm or discard edges, yielding confidence scores.
- The result is the Meta Causal Graph, a directed graph whose nodes are metadata (service names, metric types) rather than concrete instances.
2. Online Fault Localization
- When an alarm fires, MetaRCA extracts the current context (affected services, recent metric anomalies).
- It instantiates a localized sub‑graph by selecting only nodes reachable from the observed anomalies.
- Real‑time metric values are used to weight edges (higher correlation → higher weight) and to prune low‑confidence links.
- A simple scoring function (e.g., weighted PageRank) ranks candidate root causes, and the top‑k are presented to operators.
3. Evaluation Pipeline
- The framework is tested on a mix of open‑source microservice benchmarks and a production Kubernetes cluster.
- Accuracy is measured at two granularity levels: (a) service‑level (did we identify the faulty service?) and (b) metric‑level (did we pinpoint the exact failing metric?).
Results & Findings
| Metric | Baseline (best prior) | MetaRCA |
|---|---|---|
| Service‑level accuracy | 58 % | 87 % (+29 pp) |
| Metric‑level accuracy | 42 % | 90 % (+48 pp) |
| Inference latency (average) | 1.8 s | 0.9 s (≈ linear scaling) |
| Cross‑system transfer accuracy | 62 % | >80 % |
- Scalability: As the number of services grew from 50 to 500, inference time increased roughly linearly, confirming the near‑linear claim.
- Robustness to topology changes: When the same MCG was applied to a different microservice layout (different dependency graph), accuracy dropped only modestly, demonstrating true generalization.
- Knowledge freshness: Periodic re‑mining (weekly) kept the MCG aligned with code changes, preventing drift.
Practical Implications
- Faster MTTR: Developers can receive precise root‑cause hints within seconds, cutting mean‑time‑to‑repair for cloud incidents.
- Reduced on‑call fatigue: Automated, high‑confidence suggestions lower the cognitive load on SRE teams during high‑severity outages.
- Portability: Because the MCG lives at the metadata level, the same knowledge base can be reused across multiple clusters, environments, or even different organizations with minimal re‑training.
- Integration‑friendly: MetaRCA’s online component only needs access to existing observability pipelines (Prometheus, OpenTelemetry) and can be wrapped as a microservice or a side‑car, fitting naturally into CI/CD and GitOps workflows.
- Cost‑effective scaling: Near‑linear inference means you can safely add more services without a proportional increase in RCA infrastructure.
Limitations & Future Work
- Dependence on LLM quality: The initial causal hypotheses rely on the LLM’s understanding of the system; poorly documented services can lead to missing edges.
- Knowledge update latency: While weekly re‑mining works for many environments, ultra‑fast release cycles may require more frequent updates or incremental learning.
- Metric diversity: The current evaluation focuses on standard performance metrics; extending to logs, traces, or business‑level KPIs could improve coverage.
- Explainability: The scoring mechanism is relatively simple; future work could explore richer probabilistic models to provide clearer confidence explanations to operators.
Overall, MetaRCA showcases how blending AI‑generated knowledge with traditional observability data can produce a scalable, generalizable RCA engine—an approach that many cloud‑native teams can start experimenting with today.
Authors
- Shuai Liang
- Pengfei Chen
- Bozhe Tian
- Gou Tan
- Maohong Xu
- Youjun Qu
- Yahui Zhao
- Yiduo Shang
- Chongkang Tan
Paper Information
- arXiv ID: 2603.02032v1
- Categories: cs.SE
- Published: March 2, 2026
- PDF: Download PDF