[Paper] MetaRCA: A Generalizable Root Cause Analysis Framework for Cloud-Native Systems Powered by Meta Causal Knowledge

Published: 1 day ago (March 2, 2026 at 11:16 AM EST)

5 min read

Source: arXiv

Source: arXiv - 2603.02032v1

Overview

Root‑cause analysis (RCA) in cloud‑native environments is notoriously hard because services are highly distributed, constantly evolving, and generate massive streams of telemetry. The paper MetaRCA proposes a new framework that builds a reusable “meta” causal graph from a blend of large‑language‑model knowledge, historic incident reports, and live observability data. By separating the heavy‑weight knowledge construction from the lightweight online inference, MetaRCA delivers accurate, fast fault localization even as system complexity grows.

Key Contributions

Meta Causal Graph (MCG): A metadata‑level, system‑agnostic knowledge base that captures causal relationships between services, components, and metrics.
Evidence‑driven graph construction: An algorithm that fuses LLM‑generated hypotheses, past failure tickets, and real‑time monitoring data to automatically populate and continuously refine the MCG.
Dynamic instantiation: At fault time, the MCG is pruned and weighted using the current context, turning a massive global graph into a compact, inference‑ready sub‑graph.
Scalable online inference: The runtime step runs in near‑linear time with respect to the number of involved services, making it practical for large production clusters.
Strong empirical results: On 311 real‑world failures (252 public, 59 production) MetaRCA outperforms the best prior RCA baseline by 29 pp (service‑level) and 48 pp (metric‑level) and retains >80 % accuracy when transferred to completely different system topologies.

Methodology

1. Offline Knowledge Mining

LLM prompting: The authors query a large language model with system documentation and architectural diagrams to obtain candidate causal edges (e.g., “service A latency ↑ → downstream service B timeout”).
Historical fault mining: Past incident tickets and logs are parsed to extract observed cause‑effect pairs, which are then validated against the LLM suggestions.
Observability correlation: Time‑series of metrics (CPU, latency, error rates) are statistically analyzed to confirm or discard edges, yielding confidence scores.
The result is the Meta Causal Graph, a directed graph whose nodes are metadata (service names, metric types) rather than concrete instances.

2. Online Fault Localization

When an alarm fires, MetaRCA extracts the current context (affected services, recent metric anomalies).
It instantiates a localized sub‑graph by selecting only nodes reachable from the observed anomalies.
Real‑time metric values are used to weight edges (higher correlation → higher weight) and to prune low‑confidence links.
A simple scoring function (e.g., weighted PageRank) ranks candidate root causes, and the top‑k are presented to operators.

3. Evaluation Pipeline

The framework is tested on a mix of open‑source microservice benchmarks and a production Kubernetes cluster.
Accuracy is measured at two granularity levels: (a) service‑level (did we identify the faulty service?) and (b) metric‑level (did we pinpoint the exact failing metric?).

Results & Findings

Metric	Baseline (best prior)	MetaRCA
Service‑level accuracy	58 %	87 % (+29 pp)
Metric‑level accuracy	42 %	90 % (+48 pp)
Inference latency (average)	1.8 s	0.9 s (≈ linear scaling)
Cross‑system transfer accuracy	62 %	>80 %

Scalability: As the number of services grew from 50 to 500, inference time increased roughly linearly, confirming the near‑linear claim.
Robustness to topology changes: When the same MCG was applied to a different microservice layout (different dependency graph), accuracy dropped only modestly, demonstrating true generalization.
Knowledge freshness: Periodic re‑mining (weekly) kept the MCG aligned with code changes, preventing drift.

Practical Implications

Faster MTTR: Developers can receive precise root‑cause hints within seconds, cutting mean‑time‑to‑repair for cloud incidents.
Reduced on‑call fatigue: Automated, high‑confidence suggestions lower the cognitive load on SRE teams during high‑severity outages.
Portability: Because the MCG lives at the metadata level, the same knowledge base can be reused across multiple clusters, environments, or even different organizations with minimal re‑training.
Integration‑friendly: MetaRCA’s online component only needs access to existing observability pipelines (Prometheus, OpenTelemetry) and can be wrapped as a microservice or a side‑car, fitting naturally into CI/CD and GitOps workflows.
Cost‑effective scaling: Near‑linear inference means you can safely add more services without a proportional increase in RCA infrastructure.

Limitations & Future Work

Dependence on LLM quality: The initial causal hypotheses rely on the LLM’s understanding of the system; poorly documented services can lead to missing edges.
Knowledge update latency: While weekly re‑mining works for many environments, ultra‑fast release cycles may require more frequent updates or incremental learning.
Metric diversity: The current evaluation focuses on standard performance metrics; extending to logs, traces, or business‑level KPIs could improve coverage.
Explainability: The scoring mechanism is relatively simple; future work could explore richer probabilistic models to provide clearer confidence explanations to operators.

Overall, MetaRCA showcases how blending AI‑generated knowledge with traditional observability data can produce a scalable, generalizable RCA engine—an approach that many cloud‑native teams can start experimenting with today.

Authors

Shuai Liang
Pengfei Chen
Bozhe Tian
Gou Tan
Maohong Xu
Youjun Qu
Yahui Zhao
Yiduo Shang
Chongkang Tan

Paper Information

arXiv ID: 2603.02032v1
Categories: cs.SE
Published: March 2, 2026
PDF: Download PDF