[Paper] Ekka: Automated Diagnosis of Silent Errors in LLM Inference

Published: 1 day ago (June 3, 2026 at 04:32 AM EDT)

5 min read

Source: arXiv

Source: arXiv - 2606.04594v1

Overview

LLM serving platforms are becoming increasingly sophisticated, but that complexity brings a hidden danger: silent errors—situations where the model’s output quality degrades without throwing any obvious exception. The paper Ekka: Automated Diagnosis of Silent Errors in LLM Inference introduces a system that can automatically pinpoint the root cause of such degradations, turning a painful, manual debugging process into a largely automated one.

Key Contributions

Differential debugging framework that treats a reference (known‑good) LLM serving stack as a “golden” baseline and compares it against the faulty target stack.
Ekka engine that aligns and contrasts intermediate execution states (tensor shapes, memory layouts, kernel calls, etc.) to surface the exact layer or optimization step responsible for the error.
Real‑world benchmark of silent errors collected from popular serving frameworks (e.g., TensorRT‑LLM, vLLM, DeepSpeed‑Inference).
High diagnostic accuracy: 80 % @ 1 (top‑1 guess) and 88 % @ 5 (top‑5 guesses), surpassing prior state‑of‑the‑art tools.
Discovery of new bugs: Ekka identified four previously unknown silent errors that were later confirmed and patched by the framework developers.

Methodology

Reference Selection – Choose a stable, well‑tested serving implementation that produces correct outputs for a given prompt set.
Instrumentation – Both the reference and the target stack are instrumented to emit a lightweight trace of internal states (e.g., tensor metadata, kernel launch parameters, cache hits/misses).
State Alignment – Ekka maps corresponding steps across the two traces, handling variations in scheduling or parallelism by using semantic identifiers (layer IDs, token indices).
Differential Analysis – For each aligned step, Ekka computes a similarity score across multiple dimensions (shape, datatype, memory address, timing). Large divergences are flagged as “suspicious.”
Root‑Cause Ranking – Suspicious steps are ranked using a heuristic that combines the magnitude of divergence, the historical fault‑proneness of the component, and the impact on downstream layers.
Developer Feedback Loop – The top‑k ranked hypotheses are presented to developers with visual diff reports, enabling quick validation or further investigation.

The approach is deliberately non‑intrusive: it works with existing serving pipelines and requires only modest overhead (≈ 5 % latency increase during diagnostic runs).

Results & Findings

Benchmark Performance – On a suite of 120 silent‑error cases, Ekka’s top‑1 diagnosis hit rate was 80 %, and the correct root cause appeared in the top‑5 suggestions 88 % of the time.
Comparison to Baselines – Traditional log‑analysis tools and generic anomaly detectors achieved only ~45 % top‑1 accuracy, highlighting the advantage of state‑level differential debugging.
Real‑World Impact – The four newly discovered bugs spanned diverse components: a quantization‑aware kernel mis‑handling NaNs, an off‑by‑one error in KV‑cache eviction, a memory‑alignment bug in a custom CUDA kernel, and a race condition in async request batching. All were patched within weeks of reporting.
Overhead – Instrumentation added an average of 4.8 ms per inference request (≈ 3 % of typical 150 ms latency for 8‑k token generation), a trade‑off most production teams consider acceptable for a diagnostic run.

Practical Implications

Faster Incident Response – Ops teams can run Ekka on a failing service instance and receive a concise list of likely culprits within minutes, dramatically reducing mean‑time‑to‑resolution (MTTR).
Continuous Integration – Ekka can be integrated into CI pipelines to automatically compare a new build against the reference stack, catching silent regressions before they reach production.
Optimization Safety Net – When experimenting with aggressive quantization, kernel fusion, or custom CUDA kernels, developers can use Ekka to verify that performance gains haven’t introduced hidden quality drops.
Cross‑Framework Portability – Because Ekka works at the level of execution traces rather than source code, it can be applied to any LLM serving framework that supports basic instrumentation, making it a vendor‑agnostic diagnostic tool.
Developer Productivity – By surfacing the exact layer or kernel responsible for degradation, engineers spend less time hunting through logs and more time fixing the bug, leading to more stable LLM services overall.

Limitations & Future Work

Reference Dependency – Ekka’s accuracy hinges on having a trustworthy reference implementation; if the reference itself contains bugs, false positives may arise.
Scalability to Massive Models – For models exceeding 100 B parameters, the volume of trace data can become prohibitive; the authors suggest hierarchical sampling as a mitigation.
Dynamic Prompt Variability – The current system assumes a fixed prompt set for alignment; handling highly dynamic or user‑generated prompts remains an open challenge.
Extending Beyond Inference – Future work could adapt Ekka’s differential approach to training pipelines, where silent errors (e.g., gradient drift) are also costly.

Overall, Ekka represents a pragmatic step toward making LLM serving more reliable, giving developers a powerful ally in the battle against silent, hard‑to‑detect inference bugs.

Authors

Yile Gu
Zhen Zhang
Shaowei Zhu
Xinwei Fu
Jun Wu
Yida Wang
Baris Kasikci

Paper Information

arXiv ID: 2606.04594v1
Categories: cs.DC, cs.AI, cs.SE
Published: June 3, 2026
PDF: Download PDF

[Paper] Ekka: Automated Diagnosis of Silent Errors in LLM Inference

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations

[Paper] Streaming Communication in Multi-Agent Reasoning

[Paper] Reinforcement Learning from Rich Feedback with Distributional DAgger

[Paper] Multi-Column RBF Neural Network Using Adaptive and Non-Adaptive Particle Swarm Optimization