How to Implement Voice AI with Twilio and VAPI: A Step-by-Step Guide

Published: 1 week ago (December 10, 2025 at 09:19 PM EST)

5 min read

Source: Dev.to

TL;DR

Most Twilio + VAPI integrations break because developers try to merge incompatible audio streams.
Fix: Use Twilio for telephony transport (PSTN → WebSocket) and VAPI for AI processing (STT → LLM → TTS). Build a proxy server that bridges Twilio’s Media Streams to VAPI’s WebSocket protocol, handling µ‑law ↔ PCM conversion and bidirectional audio flow. The result is a production‑grade voice AI that handles real phone calls without audio glitches or dropped connections.

Prerequisites

API Access & Authentication

VAPI API key (dashboard.vapi.ai)
Twilio Account SID and Auth Token (console.twilio.com)
Twilio phone number with Voice capabilities enabled
Node.js 18+ (for webhook server)

System Requirements

Public HTTPS endpoint (e.g., ngrok http 3000 for local dev)
SSL certificate (Twilio rejects non‑HTTPS webhooks)
Minimum 512 MB RAM for the Node.js process
Port 3000 open for webhook traffic

Technical Knowledge

Familiarity with REST APIs and webhook patterns
Basic TwiML (Twilio Markup Language) knowledge
Experience with async/await in JavaScript
Understanding of WebSocket connections for real‑time streaming

Cost Awareness

Twilio voice calls: $0.0085 /min
VAPI (GPT‑4 model): ≈ $0.03 /min
Expected combined cost: $0.04–$0.05 /min for production traffic

VAPI: Get started → Get VAPI

Step‑By‑Step Tutorial

Configuration & Setup

Most Twilio + VAPI integrations fail because developers try to merge two incompatible call flows.
Reality: Twilio handles telephony (SIP, PSTN routing); VAPI handles voice AI (STT, LLM, TTS). They don’t “integrate” directly—you bridge them.

Architecture decision: Choose inbound (Twilio receives → forwards to VAPI) or outbound (VAPI initiates → uses Twilio as carrier). This guide covers inbound only.

Install dependencies

npm install @vapi-ai/web express twilio

Critical config:

VAPI needs a public webhook endpoint.
Twilio needs TwiML instructions.
These are separate responsibilities.

Architecture & Flow

flowchart LR
    A[Caller] -->|PSTN| B[Twilio Number]
    B -->|TwiML Stream| C[Your Server]
    C -->|WebSocket| D[VAPI Assistant]
    D -->|AI Response| C
    C -->|Audio Stream| B
    B -->|PSTN| A

Inbound flow:

Twilio receives the call and executes your TwiML webhook.
Audio streams to your server via Twilio Media Streams.
Your server forwards the audio to VAPI over a WebSocket.
VAPI processes the audio (STT → LLM → TTS).
The generated audio streams back through the same chain to the caller.

Step‑By‑Step Implementation

1. Create VAPI Assistant

Create an assistant via the VAPI dashboard (vapi.ai → Assistants → Create) or the API. Recommended settings for low latency:

Model: GPT‑4 (lower latency than GPT‑4‑turbo for voice)
Voice: ElevenLabs (≈ 150 ms)
Transcriber: Deepgram Nova‑2 with endpointing = 300 ms silence threshold

Production warning: The default 200 ms endpointing can cause false interruptions on mobile networks. Increase to 300–400 ms.

2. Set Up Twilio TwiML Webhook

Create an Express endpoint that returns TwiML with a <Connect> element. Twilio will stream µ‑law audio to the URL you provide.

// server.js
const express = require('express');
const app = express();

app.post('/twilio/voice', (req, res) => {
  const twiml = `<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Connect>
    <Stream url="wss://yourdomain.com/media-stream"/>
  </Connect>
</Response>`;

  res.type('text/xml');
  res.send(twiml);
});

app.listen(3000, () => console.log('Server listening on port 3000'));

Note: wss://yourdomain.com/media-stream is your WebSocket server (implemented in the next step), not a VAPI endpoint. Twilio streams µ‑law audio here.

3. Bridge Twilio Stream to VAPI

A simple WebSocket bridge that forwards audio between Twilio and VAPI, handling the start event and bidirectional media flow.

// bridge.js
const WebSocket = require('ws');
const wss = new WebSocket.Server({ port: 8080 });

wss.on('connection', (twilioWs) => {
  let vapiWs = null;
  const pendingAudio = [];

  twilioWs.on('message', (msg) => {
    const data = JSON.parse(msg);

    if (data.event === 'start') {
      // Initialise VAPI connection
      vapiWs = new WebSocket('wss://api.vapi.ai/ws');

      vapiWs.on('open', () => {
        vapiWs.send(JSON.stringify({
          type: 'assistant-request',
          assistantId: process.env.VAPI_ASSISTANT_ID,
          metadata: { callSid: data.start.callSid }
        }));

        // Flush any buffered audio
        while (pendingAudio.length) {
          vapiWs.send(JSON.stringify({ type: 'audio', data: pendingAudio.shift() }));
        }
      });

      // Forward VAPI audio back to Twilio
      vapiWs.on('message', (vapiMsg) => {
        const audio = JSON.parse(vapiMsg);
        if (audio.type === 'audio') {
          twilioWs.send(JSON.stringify({
            event: 'media',
            media: { payload: audio.data }
          }));
        }
      });
    }

    if (data.event === 'media' && vapiWs && vapiWs.readyState === WebSocket.OPEN) {
      // Forward Twilio audio to VAPI
      vapiWs.send(JSON.stringify({ type: 'audio', data: data.media.payload }));
    } else if (data.event === 'media') {
      // Buffer until VAPI connection is ready
      pendingAudio.push(data.media.payload);
    }
  });
});

Race‑condition warning: If Twilio sends audio before the VAPI WebSocket is open, buffer the packets (as shown) to avoid loss.

4. Configure Twilio Phone Number

In the Twilio Console:

Phone Numbers → Active Numbers → [your number] → Voice Configuration
Set A Call Comes In webhook URL to https://yourdomain.com/twilio/voice (HTTP POST).
If testing locally, expose the server with ngrok http 3000 and use the generated HTTPS URL.

Error Handling & Edge Cases

Twilio timeout (15 s): If VAPI doesn’t respond, Twilio hangs up. Send a keep‑alive ping to VAPI every 10 s.
Audio format mismatch: Twilio streams µ‑law 8 kHz; VAPI expects PCM 16 kHz. Either transcode on the bridge or configure VAPI’s transcriber to accept µ‑law (if supported).
Barge‑in: When the user interrupts, send { type: 'cancel' } to VAPI and flush Twilio’s audio buffer to stop the current TTS playback.

Testing & Validation

Call the Twilio number.
Verify in logs:
- TwiML webhook hit (200 response)
- WebSocket connection established
- VAPI assistant initialized
- Bidirectional audio packets flowing
Latency benchmark: Measure time from end of user speech → start of bot response. Aim for ≈ 1200 ms; higher feels broken.

Common Issues & Fixes

Symptom	Likely Cause	Fix
No audio from bot	VAPI sends PCM while Twilio expects µ‑law	Add a transcoding layer or use VAPI’s own telephony provider (bypasses Twilio).
Bot cuts off mid‑sentence	VAD endpointing too low	Increase `transcriber.endpointing` to 400 ms.
Webhook fails	Twilio requires HTTPS	Use ngrok for local testing or deploy with a valid SSL certificate.

System Diagram

graph LR
    Phone[Phone Call]
    Gateway[Call Gateway]
    IVR[Interactive Voice Response]
    STT[Speech‑to‑Text]
    NLU[Intent Detection]
    LLM[Response Generation]
    TTS[Text‑to‑Speech]
    Error[Error Handling]
    Output[Call Output]

    Phone --> Gateway
    Gateway --> IVR
    IVR --> STT
    STT --> NLU
    NLU --> LLM
    LLM --> TTS
    TTS --> Output
    Gateway -->|Call Drop/Error| Error