Sub-200ms Voice AI: Bridging Twilio and OpenAI Realtime API
Source: Dev.to
The Problem with Traditional Voice AI
The classic pipeline is:
- Speech‑to‑Text (STT) – 500 ms – 1 s
- LLM inference – 500 ms – 2 s
- Text‑to‑Speech (TTS) – ~500 ms
That adds up to 1.5 – 3.5 seconds of dead air before the agent replies. Humans notice pauses longer than ~300 ms.
OpenAI Realtime API
OpenAI’s Realtime API replaces the three‑step pipeline with a single WebSocket that accepts raw audio and streams audio back. The model “hears” the audio directly and “speaks” back, eliminating the transcription round‑trip.
Bridging Twilio Media Streams to OpenAI
When a call arrives on a Twilio number, Twilio opens a Media Stream – a WebSocket that sends raw audio packets (µ‑law, 8 kHz). Our Node.js server acts as a thin bridge:
Phone Call → Twilio → Media Stream WS → Our Server → OpenAI Realtime WSThe server’s responsibilities:
- Forward incoming audio chunks from Twilio to OpenAI.
- Pipe OpenAI’s audio responses back to Twilio.
- Perform minimal processing to keep latency low.
Core Bridge Code (JavaScript)
// Twilio → OpenAI
twilioWs.on("message", (data) => {
const msg = JSON.parse(data);
if (msg.event === "media") {
openaiWs.send(JSON.stringify({
type: "input_audio_buffer.append",
audio: msg.media.payload // already base64 µ-law
}));
}
});
// OpenAI → Twilio
openaiWs.on("message", (data) => {
const event = JSON.parse(data);
if (event.type === "response.audio.delta") {
twilioWs.send(JSON.stringify({
event: "media",
streamSid: streamSid,
media: { payload: event.delta }
}));
}
});That loop moves audio in both directions with virtually no overhead.
Additional Features
Transcription
{
"input_audio_transcription": { "model": "whisper-1" }
}Enables asynchronous transcription of both sides, providing a full call transcript without adding to response latency.
Voice Selection
OpenAI offers several voices; the author chose ash for a deeper, more natural male‑presenting agent. The Realtime API’s voice quality surpasses traditional TTS.
Interruption Handling (Barge‑in)
The Realtime API natively detects when the caller talks over the agent, stopping playback instantly—no custom VAD required.
Deployment Details
| Component | Details |
|---|---|
| Server | Node.js on a $10 / month VPS |
| Phone | Twilio (inbound + outbound) |
| AI | OpenAI Realtime API (gpt-4o-realtime-preview) |
| Process manager | PM2 |
| Domain | realtime.byldr.co (pointed directly to VPS; no Cloudflare proxy because WebSockets don’t work well through it) |
| SSL | Let’s Encrypt certificate |
Latency & Cost
- End‑to‑end latency: ~200 ms from the end of the user’s utterance to the start of the agent’s reply—fast enough to feel conversational.
- Infrastructure cost: ≈ $15 / month plus per‑minute API usage.
- Pricing note: Audio tokens are significantly more expensive than text tokens; for high‑volume scenarios a traditional three‑step pipeline with a cheaper STT (e.g., Deepgram) may be more cost‑effective despite higher latency.
Reliability Considerations
WebSocket reconnection logic is essential. Twilio Media Streams can hiccup; if either socket drops, the bridge must restart gracefully to avoid dead air for the caller.
Conclusion
You don’t need a sophisticated MLOps platform to build voice AI that feels real. A modest VPS, two WebSocket connections, and careful audio piping deliver sub‑200 ms conversational latency and a natural‑sounding agent.