[Paper] Ekka: Automated Diagnosis of Silent Errors in LLM Inference

Published: (June 3, 2026 at 04:32 AM EDT)
5 min read
Source: arXiv

Source: arXiv - 2606.04594v1

Overview

LLM serving platforms are becoming increasingly sophisticated, but that complexity brings a hidden danger: silent errors—situations where the model’s output quality degrades without throwing any obvious exception. The paper Ekka: Automated Diagnosis of Silent Errors in LLM Inference introduces a system that can automatically pinpoint the root cause of such degradations, turning a painful, manual debugging process into a largely automated one.

Key Contributions

  • Differential debugging framework that treats a reference (known‑good) LLM serving stack as a “golden” baseline and compares it against the faulty target stack.
  • Ekka engine that aligns and contrasts intermediate execution states (tensor shapes, memory layouts, kernel calls, etc.) to surface the exact layer or optimization step responsible for the error.
  • Real‑world benchmark of silent errors collected from popular serving frameworks (e.g., TensorRT‑LLM, vLLM, DeepSpeed‑Inference).
  • High diagnostic accuracy: 80 % @ 1 (top‑1 guess) and 88 % @ 5 (top‑5 guesses), surpassing prior state‑of‑the‑art tools.
  • Discovery of new bugs: Ekka identified four previously unknown silent errors that were later confirmed and patched by the framework developers.

Methodology

  1. Reference Selection – Choose a stable, well‑tested serving implementation that produces correct outputs for a given prompt set.
  2. Instrumentation – Both the reference and the target stack are instrumented to emit a lightweight trace of internal states (e.g., tensor metadata, kernel launch parameters, cache hits/misses).
  3. State Alignment – Ekka maps corresponding steps across the two traces, handling variations in scheduling or parallelism by using semantic identifiers (layer IDs, token indices).
  4. Differential Analysis – For each aligned step, Ekka computes a similarity score across multiple dimensions (shape, datatype, memory address, timing). Large divergences are flagged as “suspicious.”
  5. Root‑Cause Ranking – Suspicious steps are ranked using a heuristic that combines the magnitude of divergence, the historical fault‑proneness of the component, and the impact on downstream layers.
  6. Developer Feedback Loop – The top‑k ranked hypotheses are presented to developers with visual diff reports, enabling quick validation or further investigation.

The approach is deliberately non‑intrusive: it works with existing serving pipelines and requires only modest overhead (≈ 5 % latency increase during diagnostic runs).

Results & Findings

  • Benchmark Performance – On a suite of 120 silent‑error cases, Ekka’s top‑1 diagnosis hit rate was 80 %, and the correct root cause appeared in the top‑5 suggestions 88 % of the time.
  • Comparison to Baselines – Traditional log‑analysis tools and generic anomaly detectors achieved only ~45 % top‑1 accuracy, highlighting the advantage of state‑level differential debugging.
  • Real‑World Impact – The four newly discovered bugs spanned diverse components: a quantization‑aware kernel mis‑handling NaNs, an off‑by‑one error in KV‑cache eviction, a memory‑alignment bug in a custom CUDA kernel, and a race condition in async request batching. All were patched within weeks of reporting.
  • Overhead – Instrumentation added an average of 4.8 ms per inference request (≈ 3 % of typical 150 ms latency for 8‑k token generation), a trade‑off most production teams consider acceptable for a diagnostic run.

Practical Implications

  • Faster Incident Response – Ops teams can run Ekka on a failing service instance and receive a concise list of likely culprits within minutes, dramatically reducing mean‑time‑to‑resolution (MTTR).
  • Continuous Integration – Ekka can be integrated into CI pipelines to automatically compare a new build against the reference stack, catching silent regressions before they reach production.
  • Optimization Safety Net – When experimenting with aggressive quantization, kernel fusion, or custom CUDA kernels, developers can use Ekka to verify that performance gains haven’t introduced hidden quality drops.
  • Cross‑Framework Portability – Because Ekka works at the level of execution traces rather than source code, it can be applied to any LLM serving framework that supports basic instrumentation, making it a vendor‑agnostic diagnostic tool.
  • Developer Productivity – By surfacing the exact layer or kernel responsible for degradation, engineers spend less time hunting through logs and more time fixing the bug, leading to more stable LLM services overall.

Limitations & Future Work

  • Reference Dependency – Ekka’s accuracy hinges on having a trustworthy reference implementation; if the reference itself contains bugs, false positives may arise.
  • Scalability to Massive Models – For models exceeding 100 B parameters, the volume of trace data can become prohibitive; the authors suggest hierarchical sampling as a mitigation.
  • Dynamic Prompt Variability – The current system assumes a fixed prompt set for alignment; handling highly dynamic or user‑generated prompts remains an open challenge.
  • Extending Beyond Inference – Future work could adapt Ekka’s differential approach to training pipelines, where silent errors (e.g., gradient drift) are also costly.

Overall, Ekka represents a pragmatic step toward making LLM serving more reliable, giving developers a powerful ally in the battle against silent, hard‑to‑detect inference bugs.

Authors

  • Yile Gu
  • Zhen Zhang
  • Shaowei Zhu
  • Xinwei Fu
  • Jun Wu
  • Yida Wang
  • Baris Kasikci

Paper Information

  • arXiv ID: 2606.04594v1
  • Categories: cs.DC, cs.AI, cs.SE
  • Published: June 3, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »