Streaming LLM Tokens to 10K Concurrent Users

Published: (May 11, 2026 at 03:15 AM EDT)
4 min read
Source: Dev.to

Source: Dev.to

What We’re Building

Let me show you the architecture that keeps 10,000 concurrent SSE connections alive while streaming LLM tokens — without melting your server. We’ll walk through coroutine‑per‑connection fan‑out, bounded channel buffers for backpressure, connection draining for zero‑downtime deploys, and the per‑connection memory math that determines your real ceiling on a 4 GB container.

Prerequisites

  • Kotlin coroutines and Channel basics
  • Familiarity with Server‑Sent Events (SSE)
  • A Ktor or Netty‑based HTTP server
  • Understanding of Kubernetes pod lifecycle (helpful, not required)

Step 1: Understand the Problem

LLM APIs emit tokens every 20–80 ms. When you proxy those tokens to thousands of users via SSE, every connection becomes a long‑lived coroutine holding an open HTTP response. One slow client that can’t consume fast enough bloats your buffers, and without backpressure, you’re one GC pause away from an OOM kill.

The naive approach — unbounded lists, no draining strategy, fire‑and‑forget writes — collapses around 2,000 connections. Below is the minimal setup to get this working at scale.

Step 2: Wire Up Bounded Channels for Fan‑Out

The core pattern is a bounded Channel per SSE connection, fed by a shared upstream coroutine consuming the LLM stream:

fun fanOut(clients: List<Channel<String>>, token: String) {
    clients.forEach { channel ->
        // Non‑blocking send; drops the token for a slow client
        channel.trySend(token).isSuccess
    }
}

Each client gets its own bounded channel (I recommend 32–128 slots). When a slow client fills its buffer, trySend fails immediately. No blocking the upstream, no cascading stalls.

ApproachMemory Under LoadSlow Client ImpactFailure Mode
Unbounded list per clientGrows without limitHeap exhaustionOOM kill, all clients die
Single shared channelBoundedSlowest client blocks allHead‑of‑line blocking
Bounded channel per clientPredictable ceilingOnly that client affectedGraceful disconnect

Step 3: Run the Memory Math

Here is the gotcha that will save you hours. This arithmetic determines your actual concurrency ceiling:

ComponentPer‑Connection CostAt 10K Connections
Coroutine stack~1–2 KB10–20 MB
Bounded channel (64 slots × 40 B)~2.5 KB25 MB
Ktor/Netty response buffer~8 KB80 MB
Connection metadata + headers~1 KB10 MB
Total per connection~13 KB~130 MB

On a 4 GB container with ~2.5 GB available heap (after JVM overhead, metaspace, GC headroom), you land at roughly 12,000 connections before pressure mounts. In practice, target 8,000–10,000 to leave room for burst traffic and GC breathing room. If you need more, scale horizontally. Don’t increase buffer sizes.

Step 4: Implement Connection Draining

During rolling deployments, you can’t just kill 10,000 open SSE connections. A reliable pattern:

  1. Stop accepting new connections. Remove the pod from the load balancer.
  2. Send a custom SSE event (event: reconnect) telling clients to reconnect to a healthy pod.
  3. Set a drain deadline (e.g., 30 seconds) and forcibly close remaining connections after it expires.
  4. Use structured concurrency so coroutineScope ensures all child coroutines complete or cancel cleanly.
suspend fun handleSse(call: ApplicationCall) = coroutineScope {
    val channel = Channel<String>(capacity = 64)
    // Launch a producer that reads from the LLM stream
    launch {
        llmStream.collect { token -> fanOut(listOf(channel), token) }
    }
    // Consumer that writes to the HTTP response
    launch {
        for (msg in channel) {
            call.respondText(msg, ContentType.Text.EventStream)
        }
    }
}

Without this, Kubernetes will SIGTERM your pod, TCP connections reset, and users see a broken stream with no retry hint.

Gotchas

  • Unbounded queues are silent killers. A single stalled client accumulating 50,000 tokens at ~40 bytes each eats ~2 MB. Multiply by a few hundred slow mobile clients and you’ve consumed your entire heap.
  • Disconnecting slow clients feels aggressive, but the alternative is an OOM that disconnects everyone. Dropping one to save thousands is the right trade‑off.
  • Structured concurrency is non‑negotiable. Every SSE connection must run inside a coroutineScope tied to the request lifecycle. When a client disconnects, the coroutine cancels. When the server drains, all children cancel cooperatively. No leaked coroutines, no zombie connections.
  • Retrofit draining after an incident is miserable. Implement it from day one. You’ll thank yourself the first time you push a hotfix under load.

Wrapping Up

Budget ~13–15 KB per SSE connection. Use bounded channels (32–128 slots) per client with trySend for non‑blocking fan‑out. Implement connection draining from day one with a reconnect event and a hard deadline. On a 4 GB container, plan for 8K–10K connections max, then scale horizontally.

The docs don’t mention this, but the architecture isn’t complex — it’s disciplined. Bounded buffers, predictable memory, cooperative cancellation. That’s what keeps your server running at 10K concurrent streams.

0 views
Back to Blog

Related posts

Read more »