Streaming LLM Tokens to 10K Concurrent Users
Source: Dev.to
What We’re Building
Let me show you the architecture that keeps 10,000 concurrent SSE connections alive while streaming LLM tokens — without melting your server. We’ll walk through coroutine‑per‑connection fan‑out, bounded channel buffers for backpressure, connection draining for zero‑downtime deploys, and the per‑connection memory math that determines your real ceiling on a 4 GB container.
Prerequisites
- Kotlin coroutines and
Channelbasics - Familiarity with Server‑Sent Events (SSE)
- A Ktor or Netty‑based HTTP server
- Understanding of Kubernetes pod lifecycle (helpful, not required)
Step 1: Understand the Problem
LLM APIs emit tokens every 20–80 ms. When you proxy those tokens to thousands of users via SSE, every connection becomes a long‑lived coroutine holding an open HTTP response. One slow client that can’t consume fast enough bloats your buffers, and without backpressure, you’re one GC pause away from an OOM kill.
The naive approach — unbounded lists, no draining strategy, fire‑and‑forget writes — collapses around 2,000 connections. Below is the minimal setup to get this working at scale.
Step 2: Wire Up Bounded Channels for Fan‑Out
The core pattern is a bounded Channel per SSE connection, fed by a shared upstream coroutine consuming the LLM stream:
fun fanOut(clients: List<Channel<String>>, token: String) {
clients.forEach { channel ->
// Non‑blocking send; drops the token for a slow client
channel.trySend(token).isSuccess
}
}
Each client gets its own bounded channel (I recommend 32–128 slots). When a slow client fills its buffer, trySend fails immediately. No blocking the upstream, no cascading stalls.
| Approach | Memory Under Load | Slow Client Impact | Failure Mode |
|---|---|---|---|
| Unbounded list per client | Grows without limit | Heap exhaustion | OOM kill, all clients die |
| Single shared channel | Bounded | Slowest client blocks all | Head‑of‑line blocking |
| Bounded channel per client | Predictable ceiling | Only that client affected | Graceful disconnect |
Step 3: Run the Memory Math
Here is the gotcha that will save you hours. This arithmetic determines your actual concurrency ceiling:
| Component | Per‑Connection Cost | At 10K Connections |
|---|---|---|
| Coroutine stack | ~1–2 KB | 10–20 MB |
| Bounded channel (64 slots × 40 B) | ~2.5 KB | 25 MB |
| Ktor/Netty response buffer | ~8 KB | 80 MB |
| Connection metadata + headers | ~1 KB | 10 MB |
| Total per connection | ~13 KB | ~130 MB |
On a 4 GB container with ~2.5 GB available heap (after JVM overhead, metaspace, GC headroom), you land at roughly 12,000 connections before pressure mounts. In practice, target 8,000–10,000 to leave room for burst traffic and GC breathing room. If you need more, scale horizontally. Don’t increase buffer sizes.
Step 4: Implement Connection Draining
During rolling deployments, you can’t just kill 10,000 open SSE connections. A reliable pattern:
- Stop accepting new connections. Remove the pod from the load balancer.
- Send a custom SSE event (
event: reconnect) telling clients to reconnect to a healthy pod. - Set a drain deadline (e.g., 30 seconds) and forcibly close remaining connections after it expires.
- Use structured concurrency so
coroutineScopeensures all child coroutines complete or cancel cleanly.
suspend fun handleSse(call: ApplicationCall) = coroutineScope {
val channel = Channel<String>(capacity = 64)
// Launch a producer that reads from the LLM stream
launch {
llmStream.collect { token -> fanOut(listOf(channel), token) }
}
// Consumer that writes to the HTTP response
launch {
for (msg in channel) {
call.respondText(msg, ContentType.Text.EventStream)
}
}
}
Without this, Kubernetes will SIGTERM your pod, TCP connections reset, and users see a broken stream with no retry hint.
Gotchas
- Unbounded queues are silent killers. A single stalled client accumulating 50,000 tokens at ~40 bytes each eats ~2 MB. Multiply by a few hundred slow mobile clients and you’ve consumed your entire heap.
- Disconnecting slow clients feels aggressive, but the alternative is an OOM that disconnects everyone. Dropping one to save thousands is the right trade‑off.
- Structured concurrency is non‑negotiable. Every SSE connection must run inside a
coroutineScopetied to the request lifecycle. When a client disconnects, the coroutine cancels. When the server drains, all children cancel cooperatively. No leaked coroutines, no zombie connections. - Retrofit draining after an incident is miserable. Implement it from day one. You’ll thank yourself the first time you push a hotfix under load.
Wrapping Up
Budget ~13–15 KB per SSE connection. Use bounded channels (32–128 slots) per client with trySend for non‑blocking fan‑out. Implement connection draining from day one with a reconnect event and a hard deadline. On a 4 GB container, plan for 8K–10K connections max, then scale horizontally.
The docs don’t mention this, but the architecture isn’t complex — it’s disciplined. Bounded buffers, predictable memory, cooperative cancellation. That’s what keeps your server running at 10K concurrent streams.