[Paper] Rethinking Latency Denial-of-Service: Attacking the LLM Serving Framework, Not the Model

Published: 3 days ago (February 8, 2026 at 04:05 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.07878v1

Overview

The paper uncovers a new class of latency‑Denial‑of‑Service (DoS) attacks that target the serving infrastructure of large language models (LLMs) rather than the models themselves. By exploiting how modern LLM servers schedule and cache token generation, the authors show that attackers can dramatically slow down inference for legitimate users—raising both cost and availability concerns for any service that offers real‑time LLM access.

Key Contributions

System‑level threat model: Demonstrates that classic algorithmic complexity attacks (e.g., prompting for extremely long outputs) are largely neutralized by contemporary serving tricks such as continuous batching.
Fill‑and‑Squeeze attack: Introduces a two‑phase strategy that (1) fills the global key‑value (KV) cache to trigger head‑of‑line blocking, then (2) squeezes the scheduler into repetitive pre‑emptions, causing severe latency spikes.
Black‑box feasibility: Shows the attack can be launched without insider knowledge, using only prompt engineering and lightweight side‑channel probing of memory usage.
Empirical validation: Reports up to 20‑280× slowdown in Time‑to‑First‑Token (TTFT) and 1.5‑4× slowdown in Time‑Per‑Output‑Token (TPOT) while costing 30‑40 % less than prior algorithmic attacks.
Practical guidelines: Provides a taxonomy of prompt patterns and cache‑exhaustion tactics that can be reused by defenders to benchmark and harden their own serving stacks.

Methodology

Threat model definition – The attacker is an external client with only API access (no code injection, no privileged credentials).
System analysis – The authors dissect popular open‑source LLM serving frameworks (e.g., vLLM, FasterTransformer) to identify shared components: a global KV cache, a scheduler that batches requests, and a pre‑emptive token‑generation loop.
Attack design
- Fill phase: Send a burst of specially crafted prompts that generate many intermediate tokens, deliberately saturating the KV cache. This forces the scheduler to queue subsequent requests behind the “full” request (head‑of‑line blocking).
- Squeeze phase: Issue short, high‑frequency prompts that repeatedly pre‑empt the blocked request, causing the scheduler to constantly switch contexts and waste compute cycles.
Side‑channel probing – Use timing measurements and observable memory‑usage APIs (e.g., GPU memory stats) to infer when the cache is near capacity, allowing the attacker to adapt the fill‑to‑squeeze ratio on the fly.
Evaluation – Experiments run on multiple hardware setups (single‑GPU, multi‑GPU) and with different model sizes (7B‑30B) to quantify latency inflation and attack cost (number of tokens sent, API calls made).

Results & Findings

Metric	Baseline (no attack)	Prior algorithmic attack	Fill‑and‑Squeeze attack
TTFT slowdown	1× (baseline)	2‑5×	20‑280×
TPOT slowdown	1×	1.2‑1.8×	1.5‑4×
Attack cost (tokens)	–	100 % (full output length)	60‑70 % of baseline
Success across frameworks	–	Effective on older servers only	Works on vLLM, FasterTransformer, Triton

Key takeaways

Continuous batching isolates long‑running requests, rendering pure output‑length attacks ineffective.
The KV cache is a shared bottleneck; once saturated, even unrelated short requests suffer.
Repeated pre‑emptions amplify the scheduler’s overhead, turning a modest cache fill into a massive latency explosion.

Practical Implications

Cloud providers & SaaS platforms that expose LLM APIs must monitor KV‑cache utilization and enforce per‑client quotas on token generation per batch rather than per request.
Rate‑limiting policies need to consider aggregate token consumption across concurrent requests, not just request frequency.
Scheduler redesign: Introducing per‑client cache partitions or dynamic cache eviction policies can mitigate head‑of‑line blocking.
Observability tooling: Adding real‑time metrics for cache occupancy, pre‑emptive context switches, and TTFT variance can surface attacks early.
Cost management: Since latency directly translates to GPU time, a successful Fill‑and‑Squeeze attack can inflate operating expenses dramatically—potentially turning a “pay‑as‑you‑go” model into a liability.
Defensive prompt sanitization: Simple heuristics (e.g., limiting maximum token generation per prompt, detecting repetitive “fill” patterns) can blunt the attack without harming normal usage.

Limitations & Future Work

The study focuses on open‑source serving stacks; proprietary systems may have additional mitigations or different bottlenecks.
Attack efficacy depends on the size of the global KV cache; extremely large caches could raise the cost threshold for attackers.
The side‑channel probing assumes the attacker can read memory‑usage statistics; some managed services hide these metrics.
Future research directions include: automated detection of cache‑exhaustion patterns, adaptive scheduler algorithms that prioritize fairness under load, and extending the threat model to multi‑tenant environments with heterogeneous model sizes.

Authors

Tianyi Wang
Huawei Fan
Yuanchao Shu
Peng Cheng
Cong Wang

Paper Information

arXiv ID: 2602.07878v1
Categories: cs.CR, cs.AI
Published: February 8, 2026
PDF: Download PDF

[Paper] Rethinking Latency Denial-of-Service: Attacking the LLM Serving Framework, Not the Model

Overview

Key Contributions

Methodology

Results & Findings

Key takeaways

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Biases in the Blind Spot: Detecting What LLMs Fail to Mention

[Paper] Olaf-World: Orienting Latent Actions for Video World Modeling

[Paper] Towards Explainable Federated Learning: Understanding the Impact of Differential Privacy

[Paper] Learning on the Manifold: Unlocking Standard Diffusion Transformers with Representation Encoders