Sub-200ms Voice AI: Bridging Twilio and OpenAI Realtime API

Published: (March 14, 2026 at 03:01 PM EDT)
3 min read
Source: Dev.to

Source: Dev.to

The Problem with Traditional Voice AI

The classic pipeline is:

  1. Speech‑to‑Text (STT) – 500 ms – 1 s
  2. LLM inference – 500 ms – 2 s
  3. Text‑to‑Speech (TTS) – ~500 ms

That adds up to 1.5 – 3.5 seconds of dead air before the agent replies. Humans notice pauses longer than ~300 ms.

OpenAI Realtime API

OpenAI’s Realtime API replaces the three‑step pipeline with a single WebSocket that accepts raw audio and streams audio back. The model “hears” the audio directly and “speaks” back, eliminating the transcription round‑trip.

Bridging Twilio Media Streams to OpenAI

When a call arrives on a Twilio number, Twilio opens a Media Stream – a WebSocket that sends raw audio packets (µ‑law, 8 kHz). Our Node.js server acts as a thin bridge:

Phone Call → Twilio → Media Stream WS → Our Server → OpenAI Realtime WS

The server’s responsibilities:

  • Forward incoming audio chunks from Twilio to OpenAI.
  • Pipe OpenAI’s audio responses back to Twilio.
  • Perform minimal processing to keep latency low.

Core Bridge Code (JavaScript)

// Twilio → OpenAI
twilioWs.on("message", (data) => {
  const msg = JSON.parse(data);
  if (msg.event === "media") {
    openaiWs.send(JSON.stringify({
      type: "input_audio_buffer.append",
      audio: msg.media.payload   // already base64 µ-law
    }));
  }
});

// OpenAI → Twilio
openaiWs.on("message", (data) => {
  const event = JSON.parse(data);
  if (event.type === "response.audio.delta") {
    twilioWs.send(JSON.stringify({
      event: "media",
      streamSid: streamSid,
      media: { payload: event.delta }
    }));
  }
});

That loop moves audio in both directions with virtually no overhead.

Additional Features

Transcription

{
  "input_audio_transcription": { "model": "whisper-1" }
}

Enables asynchronous transcription of both sides, providing a full call transcript without adding to response latency.

Voice Selection

OpenAI offers several voices; the author chose ash for a deeper, more natural male‑presenting agent. The Realtime API’s voice quality surpasses traditional TTS.

Interruption Handling (Barge‑in)

The Realtime API natively detects when the caller talks over the agent, stopping playback instantly—no custom VAD required.

Deployment Details

ComponentDetails
ServerNode.js on a $10 / month VPS
PhoneTwilio (inbound + outbound)
AIOpenAI Realtime API (gpt-4o-realtime-preview)
Process managerPM2
Domainrealtime.byldr.co (pointed directly to VPS; no Cloudflare proxy because WebSockets don’t work well through it)
SSLLet’s Encrypt certificate

Latency & Cost

  • End‑to‑end latency: ~200 ms from the end of the user’s utterance to the start of the agent’s reply—fast enough to feel conversational.
  • Infrastructure cost: ≈ $15 / month plus per‑minute API usage.
  • Pricing note: Audio tokens are significantly more expensive than text tokens; for high‑volume scenarios a traditional three‑step pipeline with a cheaper STT (e.g., Deepgram) may be more cost‑effective despite higher latency.

Reliability Considerations

WebSocket reconnection logic is essential. Twilio Media Streams can hiccup; if either socket drops, the bridge must restart gracefully to avoid dead air for the caller.

Conclusion

You don’t need a sophisticated MLOps platform to build voice AI that feels real. A modest VPS, two WebSocket connections, and careful audio piping deliver sub‑200 ms conversational latency and a natural‑sounding agent.

0 views
Back to Blog

Related posts

Read more »

Travigo

Travel as fast as you speak with Gemini! Where live agents meet immersive storytelling & 3D navigation. This project was created for entering the Gemini Live Ag...

Micro games

Hey Gamers! 👾 As part of the Rapid Games Prototyping module, we are tasked with reviewing a peer's game. The challenge is to analyse a prototype built in just...