[Paper] LatencyPrism: Online Non-intrusive Latency Sculpting for SLO-Guaranteed LLM Inference

Published: 3 weeks ago (January 14, 2026 at 02:46 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.09258v1

Overview

LatencyPrism is a production‑grade system that lets operators monitor and “sculpt” the latency of large‑language‑model (LLM) inference pipelines without touching the running code or restarting services. By breaking down latency end‑to‑end, flagging anomalies in milliseconds, and keeping SLO (Service Level Objective) violations under control, it addresses a pain point that many AI‑driven products face: occasional latency spikes that ruin user experience even when average latency looks fine.

Key Contributions

Zero‑intrusion latency monitoring that works across heterogeneous hardware (GPUs, TPUs, other XPUs) and software stacks, requiring no code changes or service restarts.
Real‑time, batch‑level profiling with sub‑second alert latency, enabling operators to react to problems as they happen.
Anomaly detection model that separates normal workload‑induced latency variation from genuine performance regressions, achieving an F1‑score of 0.98 on production data.
Multi‑platform latency sculpting: the system can automatically throttle or re‑route requests to keep latency within SLO bounds.
Extensive field deployment: validated on thousands of XPUs in production for more than six months, demonstrating stability and low overhead.

Methodology

Instrumentation‑free data collection – LatencyPrism taps into existing telemetry (e.g., OS counters, XPU driver stats, network timestamps) via side‑channel hooks that do not interfere with the inference code path.
Pipeline decomposition – The end‑to‑end request is split into logical stages (pre‑processing, token generation, post‑processing, etc.). Statistical models infer the contribution of each stage from the raw timestamps.
Online anomaly detection – A lightweight streaming classifier (based on Gaussian mixture models with adaptive thresholds) continuously scores latency batches. When the score exceeds a dynamic bound, an alert is emitted.
SLO‑aware throttling – Upon detection of a potential breach, the system can apply back‑pressure or reroute traffic to less‑loaded nodes, effectively “sculpting” the latency distribution to stay within the target percentile.
Root‑cause assistance – Correlating anomaly signals with hardware utilization, queue lengths, and model‑specific metrics helps engineers pinpoint whether the spike originates from the model, the hardware, or the surrounding infrastructure.

Results & Findings

Metric	Observation
Alert latency	Median of 12 ms from spike occurrence to alert generation.
Detection accuracy	F1‑score = 0.98 on a labeled dataset of 1.2 M inference batches (balanced mix of normal and anomalous runs).
Overhead	Added ≤ 1.5 % CPU and ≤ 0.8 % XPU utilization on average, negligible impact on throughput.
SLO compliance	Reduced 99th‑percentile latency violations by 42 % across a fleet of 3,400 XPUs.
Root‑cause resolution time	Mean time to identify the underlying issue dropped from 45 min (pre‑LatencyPrism) to 7 min.

The experiments also showed that LatencyPrism can distinguish between legitimate workload‑driven latency growth (e.g., larger batch sizes) and true anomalies (e.g., driver bugs, thermal throttling) with high confidence, enabling smarter auto‑scaling decisions.

Practical Implications

Improved user experience: By catching and mitigating latency spikes before they hit end users, products that rely on LLMs (chatbots, code assistants, search augmentation) can maintain smoother interactions.
Cost savings: Faster detection of hardware or software hiccups reduces wasted compute cycles and can prevent over‑provisioning of resources to meet SLOs.
Simplified ops: Teams no longer need to embed custom profiling code or schedule downtime for instrumentation upgrades—LatencyPrism works out‑of‑the‑box on existing deployments.
Portability: Because it is hardware‑agnostic, the same monitoring stack can be reused when migrating workloads between cloud providers or from on‑prem GPUs to specialized accelerators.
Data‑driven scaling: The fine‑grained latency breakdown feeds autoscaling policies with richer signals, allowing more precise scaling of inference nodes and better utilization of spot instances.

Limitations & Future Work

Scope limited to inference – The current design focuses on forward passes; training‑time profiling is not covered.
Dependence on telemetry quality – In environments where low‑level counters are disabled or obscured (e.g., certain managed cloud services), the accuracy of stage decomposition may degrade.
Model‑specific tuning – While the anomaly detector works well out‑of‑the‑box, highly irregular models (e.g., those with dynamic control flow) may require custom feature engineering.
Future directions include extending LatencyPrism to support training pipelines, integrating reinforcement‑learning‑based auto‑tuning for throttling policies, and adding cross‑service correlation to detect cascading latency issues in multi‑service architectures.

Authors

Du Yin
Jiayi Ren
Xiayu Sun
Tianyao Zhou
Haizhu Zhou
Ruiyan Ma
Danyang Zhang

Paper Information

arXiv ID: 2601.09258v1
Categories: cs.DC, cs.LG, cs.OS
Published: January 14, 2026
PDF: Download PDF

[Paper] LatencyPrism: Online Non-intrusive Latency Sculpting for SLO-Guaranteed LLM Inference

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Do explanations generalize across large reasoning models?

[Paper] Building Production-Ready Probes For Gemini

[Paper] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

[Paper] MetaboNet: The Largest Publicly Available Consolidated Dataset for Type 1 Diabetes Management