How to Implement Voice AI with Twilio and VAPI: A Step-by-Step Guide
Source: Dev.to
TL;DR
Most Twilio + VAPI integrations break because developers try to merge incompatible audio streams.
Fix: Use Twilio for telephony transport (PSTN → WebSocket) and VAPI for AI processing (STT → LLM → TTS). Build a proxy server that bridges Twilio’s Media Streams to VAPI’s WebSocket protocol, handling µ‑law ↔ PCM conversion and bidirectional audio flow. The result is a production‑grade voice AI that handles real phone calls without audio glitches or dropped connections.
Prerequisites
API Access & Authentication
- VAPI API key (dashboard.vapi.ai)
- Twilio Account SID and Auth Token (console.twilio.com)
- Twilio phone number with Voice capabilities enabled
- Node.js 18+ (for webhook server)
System Requirements
- Public HTTPS endpoint (e.g.,
ngrok http 3000for local dev) - SSL certificate (Twilio rejects non‑HTTPS webhooks)
- Minimum 512 MB RAM for the Node.js process
- Port 3000 open for webhook traffic
Technical Knowledge
- Familiarity with REST APIs and webhook patterns
- Basic TwiML (Twilio Markup Language) knowledge
- Experience with
async/awaitin JavaScript - Understanding of WebSocket connections for real‑time streaming
Cost Awareness
- Twilio voice calls: $0.0085 /min
- VAPI (GPT‑4 model): ≈ $0.03 /min
- Expected combined cost: $0.04–$0.05 /min for production traffic
VAPI: Get started → Get VAPI
Step‑By‑Step Tutorial
Configuration & Setup
Most Twilio + VAPI integrations fail because developers try to merge two incompatible call flows.
Reality: Twilio handles telephony (SIP, PSTN routing); VAPI handles voice AI (STT, LLM, TTS). They don’t “integrate” directly—you bridge them.
Architecture decision: Choose inbound (Twilio receives → forwards to VAPI) or outbound (VAPI initiates → uses Twilio as carrier). This guide covers inbound only.
Install dependencies
npm install @vapi-ai/web express twilio
Critical config:
- VAPI needs a public webhook endpoint.
- Twilio needs TwiML instructions.
These are separate responsibilities.
Architecture & Flow
flowchart LR
A[Caller] -->|PSTN| B[Twilio Number]
B -->|TwiML Stream| C[Your Server]
C -->|WebSocket| D[VAPI Assistant]
D -->|AI Response| C
C -->|Audio Stream| B
B -->|PSTN| A
Inbound flow:
- Twilio receives the call and executes your TwiML webhook.
- Audio streams to your server via Twilio Media Streams.
- Your server forwards the audio to VAPI over a WebSocket.
- VAPI processes the audio (STT → LLM → TTS).
- The generated audio streams back through the same chain to the caller.
Step‑By‑Step Implementation
1. Create VAPI Assistant
Create an assistant via the VAPI dashboard (vapi.ai → Assistants → Create) or the API. Recommended settings for low latency:
- Model: GPT‑4 (lower latency than GPT‑4‑turbo for voice)
- Voice: ElevenLabs (≈ 150 ms)
- Transcriber: Deepgram Nova‑2 with
endpointing = 300 mssilence threshold
Production warning: The default 200 ms endpointing can cause false interruptions on mobile networks. Increase to 300–400 ms.
2. Set Up Twilio TwiML Webhook
Create an Express endpoint that returns TwiML with a <Connect> element. Twilio will stream µ‑law audio to the URL you provide.
// server.js
const express = require('express');
const app = express();
app.post('/twilio/voice', (req, res) => {
const twiml = `<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Connect>
<Stream url="wss://yourdomain.com/media-stream"/>
</Connect>
</Response>`;
res.type('text/xml');
res.send(twiml);
});
app.listen(3000, () => console.log('Server listening on port 3000'));
Note:
wss://yourdomain.com/media-streamis your WebSocket server (implemented in the next step), not a VAPI endpoint. Twilio streams µ‑law audio here.
3. Bridge Twilio Stream to VAPI
A simple WebSocket bridge that forwards audio between Twilio and VAPI, handling the start event and bidirectional media flow.
// bridge.js
const WebSocket = require('ws');
const wss = new WebSocket.Server({ port: 8080 });
wss.on('connection', (twilioWs) => {
let vapiWs = null;
const pendingAudio = [];
twilioWs.on('message', (msg) => {
const data = JSON.parse(msg);
if (data.event === 'start') {
// Initialise VAPI connection
vapiWs = new WebSocket('wss://api.vapi.ai/ws');
vapiWs.on('open', () => {
vapiWs.send(JSON.stringify({
type: 'assistant-request',
assistantId: process.env.VAPI_ASSISTANT_ID,
metadata: { callSid: data.start.callSid }
}));
// Flush any buffered audio
while (pendingAudio.length) {
vapiWs.send(JSON.stringify({ type: 'audio', data: pendingAudio.shift() }));
}
});
// Forward VAPI audio back to Twilio
vapiWs.on('message', (vapiMsg) => {
const audio = JSON.parse(vapiMsg);
if (audio.type === 'audio') {
twilioWs.send(JSON.stringify({
event: 'media',
media: { payload: audio.data }
}));
}
});
}
if (data.event === 'media' && vapiWs && vapiWs.readyState === WebSocket.OPEN) {
// Forward Twilio audio to VAPI
vapiWs.send(JSON.stringify({ type: 'audio', data: data.media.payload }));
} else if (data.event === 'media') {
// Buffer until VAPI connection is ready
pendingAudio.push(data.media.payload);
}
});
});
Race‑condition warning: If Twilio sends audio before the VAPI WebSocket is open, buffer the packets (as shown) to avoid loss.
4. Configure Twilio Phone Number
In the Twilio Console:
- Phone Numbers → Active Numbers → [your number] → Voice Configuration
- Set A Call Comes In webhook URL to
https://yourdomain.com/twilio/voice(HTTP POST). - If testing locally, expose the server with
ngrok http 3000and use the generated HTTPS URL.
Error Handling & Edge Cases
- Twilio timeout (15 s): If VAPI doesn’t respond, Twilio hangs up. Send a keep‑alive ping to VAPI every 10 s.
- Audio format mismatch: Twilio streams µ‑law 8 kHz; VAPI expects PCM 16 kHz. Either transcode on the bridge or configure VAPI’s transcriber to accept µ‑law (if supported).
- Barge‑in: When the user interrupts, send
{ type: 'cancel' }to VAPI and flush Twilio’s audio buffer to stop the current TTS playback.
Testing & Validation
- Call the Twilio number.
- Verify in logs:
- TwiML webhook hit (200 response)
- WebSocket connection established
- VAPI assistant initialized
- Bidirectional audio packets flowing
- Latency benchmark: Measure time from end of user speech → start of bot response. Aim for ≈ 1200 ms; higher feels broken.
Common Issues & Fixes
| Symptom | Likely Cause | Fix |
|---|---|---|
| No audio from bot | VAPI sends PCM while Twilio expects µ‑law | Add a transcoding layer or use VAPI’s own telephony provider (bypasses Twilio). |
| Bot cuts off mid‑sentence | VAD endpointing too low | Increase transcriber.endpointing to 400 ms. |
| Webhook fails | Twilio requires HTTPS | Use ngrok for local testing or deploy with a valid SSL certificate. |
System Diagram
graph LR
Phone[Phone Call]
Gateway[Call Gateway]
IVR[Interactive Voice Response]
STT[Speech‑to‑Text]
NLU[Intent Detection]
LLM[Response Generation]
TTS[Text‑to‑Speech]
Error[Error Handling]
Output[Call Output]
Phone --> Gateway
Gateway --> IVR
IVR --> STT
STT --> NLU
NLU --> LLM
LLM --> TTS
TTS --> Output
Gateway -->|Call Drop/Error| Error