Stop Queuing Inference Requests
Source: Dev.to
Problem Overview
Most inference backends degrade under burst.
This is not specific to LLMs. It applies to any constrained compute system:
- a single GPU
- a local model runner
- a CPU‑bound worker
- a tightly sized inference fleet
When demand spikes, most systems do one of two things:
- Accept everything and let requests accumulate internally.
- Rate‑limit arrival at the edge.
Both approaches hide the real problem.
- Queues grow.
- Latency stretches.
- Retries amplify pressure.
- Memory usage becomes unpredictable.
- Overload turns opaque.
You don’t see failure immediately; you see slow decay.
The Missing Boundary
There’s a difference between rate limiting and execution governance.
- Rate limiting controls how fast requests arrive.
- Execution governance controls how many requests are allowed to run.
Those are not the same. You can rate‑limit and still build an unbounded internal queue. If you don’t enforce a hard cap on concurrent execution, the backend becomes the queue, and queues under burst are silent liabilities.
A Different Approach: Explicit Yield
Instead of buffering overload, convert it into an explicit response. When capacity is full:
- Do not queue.
- Do not block.
- Do not defer silently.
Return:
status = yield
retry_hint_ms =
The system remains bounded, the client decides when to retry, and overload becomes explicit instead of hidden.
What This Looks Like
Here’s a simple test:
max_inflight = 1- 20 concurrent clients
- backend execution time = 10 seconds
Observed state transitions:
t=44 inflight=1 executed_total=1 yielded_total=19
t=79 inflight=0 executed_total=1 yielded_total=19
Interpretation
- Inflight never exceeded 1.
- One request executed.
- Nineteen yielded immediately.
- No queue growth.
The system did not degrade and remained bounded.
Why This Matters for Inference Systems
Inference workloads are bursty; prompts don’t arrive in smooth curves but in clusters:
- user refresh storms
- retry loops
- concurrent UI events
- load balancer reshuffles
- autoscaler lag
If your backend silently buffers that burst, you inherit tail latency and memory consequences later. Bounding execution and yielding instead trades implicit instability for explicit back‑pressure—a trade that is almost always worth it.
What This Is Not
This is not:
- a scheduler
- a policy engine
- a fairness system
- a gateway
- a dashboard
- a distributed runtime
It is a narrow primitive: hard concurrency cap + explicit yield. Nothing more.
A Small Tool, Intentionally
I built a small ingress governor around this idea. It:
- accepts newline‑delimited JSON frames over TCP
- validates upload integrity
- enforces
max_inflight - returns
yieldimmediately when saturated - exposes minimal metrics (
inflight,executed_total,yielded_total)
It does not inspect prompts, introspect models, count tokens, or apply policy. It simply governs execution slots—nothing else.
Why Not Just Use Nginx?
Because rate limiting is not execution governance. You can limit requests per second and still allow an unbounded number of concurrent backend submissions. Bounded concurrency and explicit yield are different primitives; they can coexist but solve different problems.
The Core Idea
Stop treating overload as something to buffer. Treat it as something to expose. If capacity is full, say so. Return yield. Remain bounded.