I built a real-time audio pipeline from the browser to my server. Here's what actually works.

Published: 2 months ago (February 26, 2026 at 05:41 PM EST)

3 min read

Source: Dev.to

Source: Dev.to

Getting audio from a browser to a server in real‑time sounds like a two‑line solution. It isn’t.

I built this pipeline for LiveSuggest, an AI assistant that listens to meetings and gives suggestions as the conversation happens. That means streaming audio continuously, with as little delay as possible, across a WebSocket connection that can drop at any time.

The pipeline

Here’s the full chain:

Capture audio with getUserMedia (mic) or getDisplayMedia (tab audio)
Feed it into a MediaRecorder
Slice it into chunks every N seconds
Encode each chunk to base64
Send it over WebSocket to the server
Server decodes and forwards to a transcription API

Every step has a gotcha.

MediaRecorder is great until it isn’t

MediaRecorder handles encoding for you. I use audio/webm;codecs=opus because it’s widely supported and compresses well.

const mediaRecorder = new MediaRecorder(stream, {
  mimeType: 'audio/webm;codecs=opus',
});

The problem: you don’t control the chunk boundaries. ondataavailable fires when the browser feels like it, not when you need it. If you call mediaRecorder.stop() and start() to force a new chunk, you get a new WebM header each time. That’s fine, but the chunks aren’t standalone files you can just concatenate.

I settled on 10‑second segments—short enough for responsive transcription, long enough for the transcription API to have decent context.

Base64 is wasteful but practical

Binary WebSocket frames would be more efficient, but base64 over JSON keeps the payload inspectable, works with Socket.io out of the box, and makes debugging easier.

const reader = new FileReader();
reader.readAsDataURL(blob);
reader.onloadend = () => {
  const base64 = reader.result.split(',')[1];
  socket.emit('audio-chunk', {
    sessionId,
    audio: base64,
    format: 'webm',
    duration,
    timestamp: Date.now(),
  });
};

The 33 % size overhead hasn’t been an issue in practice. A 10‑second Opus chunk is tiny.

Mixing two audio sources

If you want both mic and system audio (from a browser tab), you need to mix them. The Web Audio API makes this possible but unintuitive:

const audioContext = new AudioContext();
const destination = audioContext.createMediaStreamDestination();

const micSource = audioContext.createMediaStreamSource(micStream);
const tabSource = audioContext.createMediaStreamSource(tabStream);

micSource.connect(destination);
tabSource.connect(destination);

// destination.stream is your mixed stream

The resulting stream goes into MediaRecorder. Both sides of the conversation end up in one stream. It works better than you’d expect.

What I learned about reliability

The stream can die at any time. Chrome’s “Stop sharing” button kills getDisplayMedia streams instantly. Listening for the ended event on every track is mandatory.
Rate limiting saved me from a nasty bug. I use a sliding‑window limiter in Redis: 60 chunks per minute per session. Without it, a buggy client can silently flood the transcription API for hours.
Small chunks are almost always noise. Buffers under 2 KB are filtered before hitting the API, and transcriptions under four words (silence, breathing, keyboard sounds) are discarded. The transcription model isn’t cheap, and garbage in means garbage out.
Reconnection is non‑trivial. WebSocket drops happen. I use exponential backoff with jitter, and the server restores session state from Redis when a client reconnects to a different instance.

Was it worth building from scratch?

I considered third‑party services that handle the whole pipeline. But owning the audio layer means controlling latency, cost, and what data leaves the app. For a product where those three things matter, it was worth the complexity.

The pipeline now handles thousands of audio chunks per day. Not glamorous code, but it’s the plumbing everything else depends on.

I built a real-time audio pipeline from the browser to my server. Here's what actually works.

The pipeline

MediaRecorder is great until it isn’t

Base64 is wasteful but practical

Mixing two audio sources

What I learned about reliability

Was it worth building from scratch?

Related posts

Axiowisp 0.3.3 is out — 10 new features including WebSocket client, Snippet Library, and AI Commit Messages

Show HN: Web Audio Studio – A Visual Debugger for Web Audio API Graphs

Video Conferencing with Postgres

Dragon Ball Color Correction Process [pdf]