[Paper] Beyond Microservices: Testing Web-Scale RCA Methods on GPU-Driven LLM Workloads

Published: 1 day ago (March 2, 2026 at 11:47 AM EST)

5 min read

Source: arXiv

Source: arXiv - 2603.02057v1

Overview

Large Language Model (LLM) inference services are now a backbone of many consumer‑facing products, but the underlying GPU‑heavy stack is far more intricate than classic web or microservice deployments. This paper investigates whether existing automated root‑cause analysis (RCA) techniques—originally built for simpler services—can reliably pinpoint failures in a production‑grade, GPU‑driven LLM inference pipeline. By injecting faults in a controlled environment, the authors expose a gap between current RCA tools and the needs of modern AI workloads.

Key Contributions

Comprehensive empirical evaluation of 24 state‑of‑the‑art RCA methods (20 metric‑based, 2 trace‑based, 2 multi‑source) on a realistic LLM inference deployment.
Discovery that multi‑source approaches (combining metrics, traces, and logs) achieve the highest fault‑localization accuracy, while pure trace‑based methods largely fail.
Evidence of fault‑type dependency: metric‑only techniques work well for some failure classes (e.g., GPU memory pressure) but poorly for others (e.g., software library mismatches).
Guidelines for observability in AI‑centric stacks, including recommended metric families, tracing granularity, and logging practices tailored to GPU workloads.
Open‑source failure‑injection framework that can be reused by practitioners to benchmark their own RCA pipelines on LLM services.

Methodology

Testbed construction – The authors assembled a “best‑practice” LLM inference service: a front‑end API gateway, a request router, a GPU‑accelerated model server (e.g., TensorRT‑optimized transformer), and supporting data‑plane components (model store, cache, monitoring agents).
Controlled fault injection – Using a custom injector, they introduced 12 distinct failure scenarios (hardware faults, driver bugs, resource exhaustion, configuration errors, etc.) at known points in the stack.
RCA tool selection – They integrated 24 open‑source or commercial RCA solutions, categorizing them by the type of observability data they consume:
- Metric‑based: rely on time‑series of CPU/GPU utilization, latency, error counters.
- Trace‑based: depend on distributed request‑level tracing (e.g., OpenTelemetry).
- Multi‑source: fuse metrics, traces, and log patterns.
Evaluation metrics – For each injected fault, they measured:
- Localization accuracy (whether the tool identified the correct component).
- Mean time to diagnosis (MTTD) (how long the tool took to surface the root cause).
- False‑positive rate (spurious component alerts).
Statistical analysis – Results were aggregated across multiple runs to account for stochastic noise in GPU scheduling and network latency.

Results & Findings

RCA Category	Avg. Localization Accuracy	Typical MTTD	Notable Strengths / Weaknesses
Metric‑based (20)	62 % overall (range 45‑78 %)	12 s – 45 s	Works well for resource‑related faults (GPU memory, CPU throttling). Struggles with software bugs (library version mismatches).
Trace‑based (2)	21 % overall	> 60 s (often timeout)	Distributed tracing granularity is too coarse for GPU kernels; many LLM requests collapse into a single “model‑serve” span, hiding internal failures.
Multi‑source (2)	87 % overall	8 s – 20 s	Combines fine‑grained GPU metrics with enriched logs, achieving both speed and precision.

Key takeaways

No single existing RCA method can reliably handle the full spectrum of LLM‑specific failures.
Multi‑source fusion dramatically improves both accuracy and speed, confirming the intuition that GPU workloads need richer observability signals.
Trace‑only approaches, popular in microservice ecosystems, are insufficient because they lack visibility into low‑level hardware events and kernel‑level errors.

Practical Implications

For DevOps teams: Relying solely on distributed tracing dashboards (e.g., Jaeger, Zipkin) will leave you blind to many GPU‑related outages. Augment your stack with high‑frequency GPU telemetry (SM utilization, memory bandwidth, ECC errors) and structured logs from the model server.
For SREs: Deploy RCA pipelines that ingest both time‑series metrics and log patterns; tools like Grafana Loki + Prometheus or commercial AIOps platforms that support multi‑modal data will give you the best odds of fast diagnosis.
For Cloud providers: Offer built‑in, low‑overhead GPU metrics APIs and pre‑instrumented model‑serve containers to make multi‑source RCA a first‑class service.
For developers of LLM inference frameworks: Expose fine‑grained health endpoints (e.g., per‑kernel latency histograms) and emit machine‑readable error codes that RCA engines can correlate with higher‑level request traces.
Cost impact: Reducing mean time to repair from ~30 s (metric‑only) to < 10 s (multi‑source) can cut outage‑related revenue loss by an order of magnitude for latency‑sensitive applications (search, real‑time assistants).

Limitations & Future Work

Scope of failures: The study focuses on a single LLM architecture (Transformer inference on a single GPU node). Distributed multi‑GPU or multi‑node deployments may exhibit different failure propagation patterns.
Toolset selection: Only open‑source and a few commercial RCA solutions were evaluated; proprietary AIOps platforms could behave differently.
Observability overhead: Collecting high‑frequency GPU metrics can add CPU and network load; the paper does not quantify this trade‑off.
Future directions suggested by the authors include:
- Extending the evaluation to multi‑node, pipeline‑parallel LLM serving.
- Designing lightweight, GPU‑aware tracing standards (e.g., kernel‑level spans).
- Automating the generation of multi‑source RCA models using machine‑learning techniques trained on injected‑fault datasets.

Bottom line: As LLM inference becomes a core service, the old playbook for root‑cause analysis—built for HTTP‑centric microservices—needs a serious upgrade. Multi‑source observability that bridges hardware telemetry, detailed logs, and request traces is the emerging best practice for keeping AI‑driven applications reliable and cost‑effective.

Authors

Dominik Scheinert
Alexander Acker
Thorsten Wittkopp
Soeren Becker
Hamza Yous
Karnakar Reddy
Ibrahim Farhat
Hakim Hacid
Odej Kao

Paper Information

arXiv ID: 2603.02057v1
Categories: cs.DC
Published: March 2, 2026
PDF: Download PDF

[Paper] Beyond Microservices: Testing Web-Scale RCA Methods on GPU-Driven LLM Workloads

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Trident: Adaptive Scheduling for Heterogeneous Multimodal Data Pipelines

[Paper] Subcubic Coin Tossing in Asynchrony without Setup

[Paper] HeRo: Adaptive Orchestration of Agentic RAG on Heterogeneous Mobile SoC

[Paper] TeraPool: A Physical Design Aware, 1024 RISC-V Cores Shared-L1-Memory Scaled-up Cluster Design with High Bandwidth Main Memory Link