Scaling WebSockets to 100k Connections: Lessons from a Real-Time Cricket App

Published: (April 24, 2026 at 05:39 PM EDT)
3 min read
Source: Dev.to

Source: Dev.to

Initial Attempt and Its Limits

  • Setup: One Node.js process running socket.io; every connected client subscribed to every live match.
  • Performance:
    • 2 k concurrent connections – worked fine.
    • 15 k connections – heartbeats started dropping.
    • 40 k connections – event‑loop lag crossed 3 s; reconnection storms made everything worse.

Takeaway: A single Node process caps out somewhere between 20 k–40 k sockets, depending on what else the event loop is doing. Broadcasting to all clients from a single process is O(N) per event — a hot match drives the whole loop. Reconnection storms are real: when you restart a gateway, every disconnected client reconnects within ~2 s, creating a self‑inflicted DDoS.

Key Lessons

  1. Stateless Gateways – WebSocket nodes should be “dumb” and hold only connections; no business logic.
  2. Redis Pub/Sub Bus – Use Redis channels keyed by match_id; each gateway subscribes and fans out locally.
  3. Sticky Sessions – ALB‑level sticky sessions (via cookie) keep a client attached to the same gateway, avoiding state thrashing.

Architecture Redesign

score provider → ingest worker → Redis PUB match:123
               ↘ N gateways SUB match:123 → WS push to clients
  • Horizontal scaling: add gateway nodes; Redis fans out to all of them.
  • A single Redis cluster can handle hundreds of thousands of pub/sub messages per second.

Message Optimization

  • Delta messages only.
{ "over": 14.3, "runs": 4, "batsman": "Kohli" }
  • Compared to sending a full 4 KB snapshot, a 200‑byte delta reduces outbound bandwidth from ~480 MB/s to ~24 MB/s per gateway at 120 k connections. This dramatically lowers required instance sizes.

Handling Slow Clients

  • Mobile client on 2G may take 8 s to ACK each message.
  • Rule: If a client hasn’t ACKed within 5 s, drop the oldest queued messages and send a "resync" event. The client then fetches the full scorecard via a REST endpoint and resumes the WebSocket.
  • This trades a small UX hiccup for server stability and prevents OOM crashes.

Graceful Restarts and Deploys

  • Add a random 0–5 s jitter to each client’s reconnect delay when a gateway restarts.
  • On the server side, drain gateways gracefully: ALB stops sending new connections, existing connections finish their current messages, then the process exits.
  • Rolling deployments become a non‑event.

Monitoring Health

Three numbers tell you if real‑time is healthy:

MetricDesired Threshold
Event‑loop lag (p99)Tip: Use uWebSockets.js from the start — it’s ~5× more efficient than socket.io for raw WebSocket throughput. Build a load‑shedding mechanism early: drop low‑priority events (e.g., “commentary”) before high‑priority ones (e.g., “wicket”).

Conclusion

Whether it’s live sports, collaborative editing, trading platforms, or real‑time dashboards — scaling WebSockets is a discipline with sharp edges. If you’re building in this space, Xenotix Labs has shipped real‑time stacks that survive match‑day India traffic. Reach out at .

0 views
Back to Blog

Related posts

Read more »

The Database Bottleneck

> “It was fast… until users showed up.” > That’s what I told a friend when we were debugging his system. The Problem Every request depended on the database. Eac...