Streaming LLM Tokens to 10K Concurrent Users

Published: 12 hours ago (May 11, 2026 at 03:15 AM EDT)

4 min read

Source: Dev.to

What We’re Building

Let me show you the architecture that keeps 10,000 concurrent SSE connections alive while streaming LLM tokens — without melting your server. We’ll walk through coroutine‑per‑connection fan‑out, bounded channel buffers for backpressure, connection draining for zero‑downtime deploys, and the per‑connection memory math that determines your real ceiling on a 4 GB container.

Prerequisites

Kotlin coroutines and Channel basics
Familiarity with Server‑Sent Events (SSE)
A Ktor or Netty‑based HTTP server
Understanding of Kubernetes pod lifecycle (helpful, not required)

Step 1: Understand the Problem

LLM APIs emit tokens every 20–80 ms. When you proxy those tokens to thousands of users via SSE, every connection becomes a long‑lived coroutine holding an open HTTP response. One slow client that can’t consume fast enough bloats your buffers, and without backpressure, you’re one GC pause away from an OOM kill.

The naive approach — unbounded lists, no draining strategy, fire‑and‑forget writes — collapses around 2,000 connections. Below is the minimal setup to get this working at scale.

Step 2: Wire Up Bounded Channels for Fan‑Out

The core pattern is a bounded Channel per SSE connection, fed by a shared upstream coroutine consuming the LLM stream:

fun fanOut(clients: List<Channel<String>>, token: String) {
    clients.forEach { channel ->
        // Non‑blocking send; drops the token for a slow client
        channel.trySend(token).isSuccess
    }
}

Each client gets its own bounded channel (I recommend 32–128 slots). When a slow client fills its buffer, trySend fails immediately. No blocking the upstream, no cascading stalls.

Approach	Memory Under Load	Slow Client Impact	Failure Mode
Unbounded list per client	Grows without limit	Heap exhaustion	OOM kill, all clients die
Single shared channel	Bounded	Slowest client blocks all	Head‑of‑line blocking
Bounded channel per client	Predictable ceiling	Only that client affected	Graceful disconnect

Step 3: Run the Memory Math

Here is the gotcha that will save you hours. This arithmetic determines your actual concurrency ceiling:

Component	Per‑Connection Cost	At 10K Connections
Coroutine stack	~1–2 KB	10–20 MB
Bounded channel (64 slots × 40 B)	~2.5 KB	25 MB
Ktor/Netty response buffer	~8 KB	80 MB
Connection metadata + headers	~1 KB	10 MB
Total per connection	~13 KB	~130 MB

On a 4 GB container with ~2.5 GB available heap (after JVM overhead, metaspace, GC headroom), you land at roughly 12,000 connections before pressure mounts. In practice, target 8,000–10,000 to leave room for burst traffic and GC breathing room. If you need more, scale horizontally. Don’t increase buffer sizes.

Step 4: Implement Connection Draining

During rolling deployments, you can’t just kill 10,000 open SSE connections. A reliable pattern:

Stop accepting new connections. Remove the pod from the load balancer.
Send a custom SSE event (event: reconnect) telling clients to reconnect to a healthy pod.
Set a drain deadline (e.g., 30 seconds) and forcibly close remaining connections after it expires.
Use structured concurrency so coroutineScope ensures all child coroutines complete or cancel cleanly.

suspend fun handleSse(call: ApplicationCall) = coroutineScope {
    val channel = Channel<String>(capacity = 64)
    // Launch a producer that reads from the LLM stream
    launch {
        llmStream.collect { token -> fanOut(listOf(channel), token) }
    }
    // Consumer that writes to the HTTP response
    launch {
        for (msg in channel) {
            call.respondText(msg, ContentType.Text.EventStream)
        }
    }
}

Without this, Kubernetes will SIGTERM your pod, TCP connections reset, and users see a broken stream with no retry hint.

Gotchas

Unbounded queues are silent killers. A single stalled client accumulating 50,000 tokens at ~40 bytes each eats ~2 MB. Multiply by a few hundred slow mobile clients and you’ve consumed your entire heap.
Disconnecting slow clients feels aggressive, but the alternative is an OOM that disconnects everyone. Dropping one to save thousands is the right trade‑off.
Structured concurrency is non‑negotiable. Every SSE connection must run inside a coroutineScope tied to the request lifecycle. When a client disconnects, the coroutine cancels. When the server drains, all children cancel cooperatively. No leaked coroutines, no zombie connections.
Retrofit draining after an incident is miserable. Implement it from day one. You’ll thank yourself the first time you push a hotfix under load.

Wrapping Up

Budget ~13–15 KB per SSE connection. Use bounded channels (32–128 slots) per client with trySend for non‑blocking fan‑out. Implement connection draining from day one with a reconnect event and a hard deadline. On a 4 GB container, plan for 8K–10K connections max, then scale horizontally.

The docs don’t mention this, but the architecture isn’t complex — it’s disciplined. Bounded buffers, predictable memory, cooperative cancellation. That’s what keeps your server running at 10K concurrent streams.

Streaming LLM Tokens to 10K Concurrent Users

What We’re Building

Prerequisites

Step 1: Understand the Problem

Step 2: Wire Up Bounded Channels for Fan‑Out

Step 3: Run the Memory Math

Step 4: Implement Connection Draining

Gotchas

Wrapping Up

Related posts

How to Test MCP Servers Before They Break Your CI

ForgeOS Dojo - learn AI-assisted development, build something that matters

让 AI Agent 学会共享经验——我做了个'蚁群信息素'实验

The Gap Nobody Talks About :Students, Companies & The Technology Pressure