Scaling WebSockets to 100k Connections: Lessons from a Real-Time Cricket App
Source: Dev.to
Initial Attempt and Its Limits
- Setup: One Node.js process running
socket.io; every connected client subscribed to every live match. - Performance:
- 2 k concurrent connections – worked fine.
- 15 k connections – heartbeats started dropping.
- 40 k connections – event‑loop lag crossed 3 s; reconnection storms made everything worse.
Takeaway: A single Node process caps out somewhere between 20 k–40 k sockets, depending on what else the event loop is doing. Broadcasting to all clients from a single process is O(N) per event — a hot match drives the whole loop. Reconnection storms are real: when you restart a gateway, every disconnected client reconnects within ~2 s, creating a self‑inflicted DDoS.
Key Lessons
- Stateless Gateways – WebSocket nodes should be “dumb” and hold only connections; no business logic.
- Redis Pub/Sub Bus – Use Redis channels keyed by
match_id; each gateway subscribes and fans out locally. - Sticky Sessions – ALB‑level sticky sessions (via cookie) keep a client attached to the same gateway, avoiding state thrashing.
Architecture Redesign
score provider → ingest worker → Redis PUB match:123
↘ N gateways SUB match:123 → WS push to clients
- Horizontal scaling: add gateway nodes; Redis fans out to all of them.
- A single Redis cluster can handle hundreds of thousands of pub/sub messages per second.
Message Optimization
- Delta messages only.
{ "over": 14.3, "runs": 4, "batsman": "Kohli" }
- Compared to sending a full 4 KB snapshot, a 200‑byte delta reduces outbound bandwidth from ~480 MB/s to ~24 MB/s per gateway at 120 k connections. This dramatically lowers required instance sizes.
Handling Slow Clients
- Mobile client on 2G may take 8 s to ACK each message.
- Rule: If a client hasn’t ACKed within 5 s, drop the oldest queued messages and send a
"resync"event. The client then fetches the full scorecard via a REST endpoint and resumes the WebSocket. - This trades a small UX hiccup for server stability and prevents OOM crashes.
Graceful Restarts and Deploys
- Add a random 0–5 s jitter to each client’s reconnect delay when a gateway restarts.
- On the server side, drain gateways gracefully: ALB stops sending new connections, existing connections finish their current messages, then the process exits.
- Rolling deployments become a non‑event.
Monitoring Health
Three numbers tell you if real‑time is healthy:
| Metric | Desired Threshold |
|---|---|
| Event‑loop lag (p99) | Tip: Use uWebSockets.js from the start — it’s ~5× more efficient than socket.io for raw WebSocket throughput. Build a load‑shedding mechanism early: drop low‑priority events (e.g., “commentary”) before high‑priority ones (e.g., “wicket”). |
Conclusion
Whether it’s live sports, collaborative editing, trading platforms, or real‑time dashboards — scaling WebSockets is a discipline with sharp edges. If you’re building in this space, Xenotix Labs has shipped real‑time stacks that survive match‑day India traffic. Reach out at .