[Paper] LatencyPrism: Online Non-intrusive Latency Sculpting for SLO-Guaranteed LLM Inference
Source: arXiv - 2601.09258v1
Overview
LatencyPrism is a production‑grade system that lets operators monitor and “sculpt” the latency of large‑language‑model (LLM) inference pipelines without touching the running code or restarting services. By breaking down latency end‑to‑end, flagging anomalies in milliseconds, and keeping SLO (Service Level Objective) violations under control, it addresses a pain point that many AI‑driven products face: occasional latency spikes that ruin user experience even when average latency looks fine.
Key Contributions
- Zero‑intrusion latency monitoring that works across heterogeneous hardware (GPUs, TPUs, other XPUs) and software stacks, requiring no code changes or service restarts.
- Real‑time, batch‑level profiling with sub‑second alert latency, enabling operators to react to problems as they happen.
- Anomaly detection model that separates normal workload‑induced latency variation from genuine performance regressions, achieving an F1‑score of 0.98 on production data.
- Multi‑platform latency sculpting: the system can automatically throttle or re‑route requests to keep latency within SLO bounds.
- Extensive field deployment: validated on thousands of XPUs in production for more than six months, demonstrating stability and low overhead.
Methodology
- Instrumentation‑free data collection – LatencyPrism taps into existing telemetry (e.g., OS counters, XPU driver stats, network timestamps) via side‑channel hooks that do not interfere with the inference code path.
- Pipeline decomposition – The end‑to‑end request is split into logical stages (pre‑processing, token generation, post‑processing, etc.). Statistical models infer the contribution of each stage from the raw timestamps.
- Online anomaly detection – A lightweight streaming classifier (based on Gaussian mixture models with adaptive thresholds) continuously scores latency batches. When the score exceeds a dynamic bound, an alert is emitted.
- SLO‑aware throttling – Upon detection of a potential breach, the system can apply back‑pressure or reroute traffic to less‑loaded nodes, effectively “sculpting” the latency distribution to stay within the target percentile.
- Root‑cause assistance – Correlating anomaly signals with hardware utilization, queue lengths, and model‑specific metrics helps engineers pinpoint whether the spike originates from the model, the hardware, or the surrounding infrastructure.
Results & Findings
| Metric | Observation |
|---|---|
| Alert latency | Median of 12 ms from spike occurrence to alert generation. |
| Detection accuracy | F1‑score = 0.98 on a labeled dataset of 1.2 M inference batches (balanced mix of normal and anomalous runs). |
| Overhead | Added ≤ 1.5 % CPU and ≤ 0.8 % XPU utilization on average, negligible impact on throughput. |
| SLO compliance | Reduced 99th‑percentile latency violations by 42 % across a fleet of 3,400 XPUs. |
| Root‑cause resolution time | Mean time to identify the underlying issue dropped from 45 min (pre‑LatencyPrism) to 7 min. |
The experiments also showed that LatencyPrism can distinguish between legitimate workload‑driven latency growth (e.g., larger batch sizes) and true anomalies (e.g., driver bugs, thermal throttling) with high confidence, enabling smarter auto‑scaling decisions.
Practical Implications
- Improved user experience: By catching and mitigating latency spikes before they hit end users, products that rely on LLMs (chatbots, code assistants, search augmentation) can maintain smoother interactions.
- Cost savings: Faster detection of hardware or software hiccups reduces wasted compute cycles and can prevent over‑provisioning of resources to meet SLOs.
- Simplified ops: Teams no longer need to embed custom profiling code or schedule downtime for instrumentation upgrades—LatencyPrism works out‑of‑the‑box on existing deployments.
- Portability: Because it is hardware‑agnostic, the same monitoring stack can be reused when migrating workloads between cloud providers or from on‑prem GPUs to specialized accelerators.
- Data‑driven scaling: The fine‑grained latency breakdown feeds autoscaling policies with richer signals, allowing more precise scaling of inference nodes and better utilization of spot instances.
Limitations & Future Work
- Scope limited to inference – The current design focuses on forward passes; training‑time profiling is not covered.
- Dependence on telemetry quality – In environments where low‑level counters are disabled or obscured (e.g., certain managed cloud services), the accuracy of stage decomposition may degrade.
- Model‑specific tuning – While the anomaly detector works well out‑of‑the‑box, highly irregular models (e.g., those with dynamic control flow) may require custom feature engineering.
- Future directions include extending LatencyPrism to support training pipelines, integrating reinforcement‑learning‑based auto‑tuning for throttling policies, and adding cross‑service correlation to detect cascading latency issues in multi‑service architectures.
Authors
- Du Yin
- Jiayi Ren
- Xiayu Sun
- Tianyao Zhou
- Haizhu Zhou
- Ruiyan Ma
- Danyang Zhang
Paper Information
- arXiv ID: 2601.09258v1
- Categories: cs.DC, cs.LG, cs.OS
- Published: January 14, 2026
- PDF: Download PDF