Sub-200ms Voice AI: Bridging Twilio and OpenAI Realtime API

Published: 1 month ago (March 14, 2026 at 03:01 PM EDT)

3 min read

Source: Dev.to

Source: Dev.to

The Problem with Traditional Voice AI

The classic pipeline is:

Speech‑to‑Text (STT) – 500 ms – 1 s
LLM inference – 500 ms – 2 s
Text‑to‑Speech (TTS) – ~500 ms

That adds up to 1.5 – 3.5 seconds of dead air before the agent replies. Humans notice pauses longer than ~300 ms.

OpenAI Realtime API

OpenAI’s Realtime API replaces the three‑step pipeline with a single WebSocket that accepts raw audio and streams audio back. The model “hears” the audio directly and “speaks” back, eliminating the transcription round‑trip.

Bridging Twilio Media Streams to OpenAI

When a call arrives on a Twilio number, Twilio opens a Media Stream – a WebSocket that sends raw audio packets (µ‑law, 8 kHz). Our Node.js server acts as a thin bridge:

Phone Call → Twilio → Media Stream WS → Our Server → OpenAI Realtime WS

The server’s responsibilities:

Forward incoming audio chunks from Twilio to OpenAI.
Pipe OpenAI’s audio responses back to Twilio.
Perform minimal processing to keep latency low.

Core Bridge Code (JavaScript)

// Twilio → OpenAI
twilioWs.on("message", (data) => {
  const msg = JSON.parse(data);
  if (msg.event === "media") {
    openaiWs.send(JSON.stringify({
      type: "input_audio_buffer.append",
      audio: msg.media.payload   // already base64 µ-law
    }));
  }
});

// OpenAI → Twilio
openaiWs.on("message", (data) => {
  const event = JSON.parse(data);
  if (event.type === "response.audio.delta") {
    twilioWs.send(JSON.stringify({
      event: "media",
      streamSid: streamSid,
      media: { payload: event.delta }
    }));
  }
});

That loop moves audio in both directions with virtually no overhead.

Additional Features

Transcription

{
  "input_audio_transcription": { "model": "whisper-1" }
}

Enables asynchronous transcription of both sides, providing a full call transcript without adding to response latency.

Voice Selection

OpenAI offers several voices; the author chose ash for a deeper, more natural male‑presenting agent. The Realtime API’s voice quality surpasses traditional TTS.

Interruption Handling (Barge‑in)

The Realtime API natively detects when the caller talks over the agent, stopping playback instantly—no custom VAD required.

Deployment Details

Component	Details
Server	Node.js on a $10 / month VPS
Phone	Twilio (inbound + outbound)
AI	OpenAI Realtime API (`gpt-4o-realtime-preview`)
Process manager	PM2
Domain	`realtime.byldr.co` (pointed directly to VPS; no Cloudflare proxy because WebSockets don’t work well through it)
SSL	Let’s Encrypt certificate

Latency & Cost

End‑to‑end latency: ~200 ms from the end of the user’s utterance to the start of the agent’s reply—fast enough to feel conversational.
Infrastructure cost: ≈ $15 / month plus per‑minute API usage.
Pricing note: Audio tokens are significantly more expensive than text tokens; for high‑volume scenarios a traditional three‑step pipeline with a cheaper STT (e.g., Deepgram) may be more cost‑effective despite higher latency.

Reliability Considerations

WebSocket reconnection logic is essential. Twilio Media Streams can hiccup; if either socket drops, the bridge must restart gracefully to avoid dead air for the caller.

Conclusion

You don’t need a sophisticated MLOps platform to build voice AI that feels real. A modest VPS, two WebSocket connections, and careful audio piping deliver sub‑200 ms conversational latency and a natural‑sounding agent.