[Paper] Beyond Microservices: Testing Web-Scale RCA Methods on GPU-Driven LLM Workloads

Published: (March 2, 2026 at 11:47 AM EST)
5 min read
Source: arXiv

Source: arXiv - 2603.02057v1

Overview

Large Language Model (LLM) inference services are now a backbone of many consumer‑facing products, but the underlying GPU‑heavy stack is far more intricate than classic web or microservice deployments. This paper investigates whether existing automated root‑cause analysis (RCA) techniques—originally built for simpler services—can reliably pinpoint failures in a production‑grade, GPU‑driven LLM inference pipeline. By injecting faults in a controlled environment, the authors expose a gap between current RCA tools and the needs of modern AI workloads.

Key Contributions

  • Comprehensive empirical evaluation of 24 state‑of‑the‑art RCA methods (20 metric‑based, 2 trace‑based, 2 multi‑source) on a realistic LLM inference deployment.
  • Discovery that multi‑source approaches (combining metrics, traces, and logs) achieve the highest fault‑localization accuracy, while pure trace‑based methods largely fail.
  • Evidence of fault‑type dependency: metric‑only techniques work well for some failure classes (e.g., GPU memory pressure) but poorly for others (e.g., software library mismatches).
  • Guidelines for observability in AI‑centric stacks, including recommended metric families, tracing granularity, and logging practices tailored to GPU workloads.
  • Open‑source failure‑injection framework that can be reused by practitioners to benchmark their own RCA pipelines on LLM services.

Methodology

  1. Testbed construction – The authors assembled a “best‑practice” LLM inference service: a front‑end API gateway, a request router, a GPU‑accelerated model server (e.g., TensorRT‑optimized transformer), and supporting data‑plane components (model store, cache, monitoring agents).
  2. Controlled fault injection – Using a custom injector, they introduced 12 distinct failure scenarios (hardware faults, driver bugs, resource exhaustion, configuration errors, etc.) at known points in the stack.
  3. RCA tool selection – They integrated 24 open‑source or commercial RCA solutions, categorizing them by the type of observability data they consume:
    • Metric‑based: rely on time‑series of CPU/GPU utilization, latency, error counters.
    • Trace‑based: depend on distributed request‑level tracing (e.g., OpenTelemetry).
    • Multi‑source: fuse metrics, traces, and log patterns.
  4. Evaluation metrics – For each injected fault, they measured:
    • Localization accuracy (whether the tool identified the correct component).
    • Mean time to diagnosis (MTTD) (how long the tool took to surface the root cause).
    • False‑positive rate (spurious component alerts).
  5. Statistical analysis – Results were aggregated across multiple runs to account for stochastic noise in GPU scheduling and network latency.

Results & Findings

RCA CategoryAvg. Localization AccuracyTypical MTTDNotable Strengths / Weaknesses
Metric‑based (20)62 % overall (range 45‑78 %)12 s – 45 sWorks well for resource‑related faults (GPU memory, CPU throttling). Struggles with software bugs (library version mismatches).
Trace‑based (2)21 % overall> 60 s (often timeout)Distributed tracing granularity is too coarse for GPU kernels; many LLM requests collapse into a single “model‑serve” span, hiding internal failures.
Multi‑source (2)87 % overall8 s – 20 sCombines fine‑grained GPU metrics with enriched logs, achieving both speed and precision.

Key takeaways

  • No single existing RCA method can reliably handle the full spectrum of LLM‑specific failures.
  • Multi‑source fusion dramatically improves both accuracy and speed, confirming the intuition that GPU workloads need richer observability signals.
  • Trace‑only approaches, popular in microservice ecosystems, are insufficient because they lack visibility into low‑level hardware events and kernel‑level errors.

Practical Implications

  • For DevOps teams: Relying solely on distributed tracing dashboards (e.g., Jaeger, Zipkin) will leave you blind to many GPU‑related outages. Augment your stack with high‑frequency GPU telemetry (SM utilization, memory bandwidth, ECC errors) and structured logs from the model server.
  • For SREs: Deploy RCA pipelines that ingest both time‑series metrics and log patterns; tools like Grafana Loki + Prometheus or commercial AIOps platforms that support multi‑modal data will give you the best odds of fast diagnosis.
  • For Cloud providers: Offer built‑in, low‑overhead GPU metrics APIs and pre‑instrumented model‑serve containers to make multi‑source RCA a first‑class service.
  • For developers of LLM inference frameworks: Expose fine‑grained health endpoints (e.g., per‑kernel latency histograms) and emit machine‑readable error codes that RCA engines can correlate with higher‑level request traces.
  • Cost impact: Reducing mean time to repair from ~30 s (metric‑only) to < 10 s (multi‑source) can cut outage‑related revenue loss by an order of magnitude for latency‑sensitive applications (search, real‑time assistants).

Limitations & Future Work

  • Scope of failures: The study focuses on a single LLM architecture (Transformer inference on a single GPU node). Distributed multi‑GPU or multi‑node deployments may exhibit different failure propagation patterns.
  • Toolset selection: Only open‑source and a few commercial RCA solutions were evaluated; proprietary AIOps platforms could behave differently.
  • Observability overhead: Collecting high‑frequency GPU metrics can add CPU and network load; the paper does not quantify this trade‑off.
  • Future directions suggested by the authors include:
    • Extending the evaluation to multi‑node, pipeline‑parallel LLM serving.
    • Designing lightweight, GPU‑aware tracing standards (e.g., kernel‑level spans).
    • Automating the generation of multi‑source RCA models using machine‑learning techniques trained on injected‑fault datasets.

Bottom line: As LLM inference becomes a core service, the old playbook for root‑cause analysis—built for HTTP‑centric microservices—needs a serious upgrade. Multi‑source observability that bridges hardware telemetry, detailed logs, and request traces is the emerging best practice for keeping AI‑driven applications reliable and cost‑effective.

Authors

  • Dominik Scheinert
  • Alexander Acker
  • Thorsten Wittkopp
  • Soeren Becker
  • Hamza Yous
  • Karnakar Reddy
  • Ibrahim Farhat
  • Hakim Hacid
  • Odej Kao

Paper Information

  • arXiv ID: 2603.02057v1
  • Categories: cs.DC
  • Published: March 2, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »