How to Implement Voice AI with Twilio and VAPI: A Step-by-Step Guide

Published: (December 10, 2025 at 09:19 PM EST)
5 min read
Source: Dev.to

Source: Dev.to

TL;DR

Most Twilio + VAPI integrations break because developers try to merge incompatible audio streams.
Fix: Use Twilio for telephony transport (PSTN → WebSocket) and VAPI for AI processing (STT → LLM → TTS). Build a proxy server that bridges Twilio’s Media Streams to VAPI’s WebSocket protocol, handling µ‑law ↔ PCM conversion and bidirectional audio flow. The result is a production‑grade voice AI that handles real phone calls without audio glitches or dropped connections.

Prerequisites

API Access & Authentication

  • VAPI API key (dashboard.vapi.ai)
  • Twilio Account SID and Auth Token (console.twilio.com)
  • Twilio phone number with Voice capabilities enabled
  • Node.js 18+ (for webhook server)

System Requirements

  • Public HTTPS endpoint (e.g., ngrok http 3000 for local dev)
  • SSL certificate (Twilio rejects non‑HTTPS webhooks)
  • Minimum 512 MB RAM for the Node.js process
  • Port 3000 open for webhook traffic

Technical Knowledge

  • Familiarity with REST APIs and webhook patterns
  • Basic TwiML (Twilio Markup Language) knowledge
  • Experience with async/await in JavaScript
  • Understanding of WebSocket connections for real‑time streaming

Cost Awareness

  • Twilio voice calls: $0.0085 /min
  • VAPI (GPT‑4 model): ≈ $0.03 /min
  • Expected combined cost: $0.04–$0.05 /min for production traffic

VAPI: Get started → Get VAPI

Step‑By‑Step Tutorial

Configuration & Setup

Most Twilio + VAPI integrations fail because developers try to merge two incompatible call flows.
Reality: Twilio handles telephony (SIP, PSTN routing); VAPI handles voice AI (STT, LLM, TTS). They don’t “integrate” directly—you bridge them.

Architecture decision: Choose inbound (Twilio receives → forwards to VAPI) or outbound (VAPI initiates → uses Twilio as carrier). This guide covers inbound only.

Install dependencies

npm install @vapi-ai/web express twilio

Critical config:

  • VAPI needs a public webhook endpoint.
  • Twilio needs TwiML instructions.
    These are separate responsibilities.

Architecture & Flow

flowchart LR
    A[Caller] -->|PSTN| B[Twilio Number]
    B -->|TwiML Stream| C[Your Server]
    C -->|WebSocket| D[VAPI Assistant]
    D -->|AI Response| C
    C -->|Audio Stream| B
    B -->|PSTN| A

Inbound flow:

  1. Twilio receives the call and executes your TwiML webhook.
  2. Audio streams to your server via Twilio Media Streams.
  3. Your server forwards the audio to VAPI over a WebSocket.
  4. VAPI processes the audio (STT → LLM → TTS).
  5. The generated audio streams back through the same chain to the caller.

Step‑By‑Step Implementation

1. Create VAPI Assistant

Create an assistant via the VAPI dashboard (vapi.ai → Assistants → Create) or the API. Recommended settings for low latency:

  • Model: GPT‑4 (lower latency than GPT‑4‑turbo for voice)
  • Voice: ElevenLabs (≈ 150 ms)
  • Transcriber: Deepgram Nova‑2 with endpointing = 300 ms silence threshold

Production warning: The default 200 ms endpointing can cause false interruptions on mobile networks. Increase to 300–400 ms.

2. Set Up Twilio TwiML Webhook

Create an Express endpoint that returns TwiML with a <Connect> element. Twilio will stream µ‑law audio to the URL you provide.

// server.js
const express = require('express');
const app = express();

app.post('/twilio/voice', (req, res) => {
  const twiml = `<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Connect>
    <Stream url="wss://yourdomain.com/media-stream"/>
  </Connect>
</Response>`;

  res.type('text/xml');
  res.send(twiml);
});

app.listen(3000, () => console.log('Server listening on port 3000'));

Note: wss://yourdomain.com/media-stream is your WebSocket server (implemented in the next step), not a VAPI endpoint. Twilio streams µ‑law audio here.

3. Bridge Twilio Stream to VAPI

A simple WebSocket bridge that forwards audio between Twilio and VAPI, handling the start event and bidirectional media flow.

// bridge.js
const WebSocket = require('ws');
const wss = new WebSocket.Server({ port: 8080 });

wss.on('connection', (twilioWs) => {
  let vapiWs = null;
  const pendingAudio = [];

  twilioWs.on('message', (msg) => {
    const data = JSON.parse(msg);

    if (data.event === 'start') {
      // Initialise VAPI connection
      vapiWs = new WebSocket('wss://api.vapi.ai/ws');

      vapiWs.on('open', () => {
        vapiWs.send(JSON.stringify({
          type: 'assistant-request',
          assistantId: process.env.VAPI_ASSISTANT_ID,
          metadata: { callSid: data.start.callSid }
        }));

        // Flush any buffered audio
        while (pendingAudio.length) {
          vapiWs.send(JSON.stringify({ type: 'audio', data: pendingAudio.shift() }));
        }
      });

      // Forward VAPI audio back to Twilio
      vapiWs.on('message', (vapiMsg) => {
        const audio = JSON.parse(vapiMsg);
        if (audio.type === 'audio') {
          twilioWs.send(JSON.stringify({
            event: 'media',
            media: { payload: audio.data }
          }));
        }
      });
    }

    if (data.event === 'media' && vapiWs && vapiWs.readyState === WebSocket.OPEN) {
      // Forward Twilio audio to VAPI
      vapiWs.send(JSON.stringify({ type: 'audio', data: data.media.payload }));
    } else if (data.event === 'media') {
      // Buffer until VAPI connection is ready
      pendingAudio.push(data.media.payload);
    }
  });
});

Race‑condition warning: If Twilio sends audio before the VAPI WebSocket is open, buffer the packets (as shown) to avoid loss.

4. Configure Twilio Phone Number

In the Twilio Console:

  1. Phone Numbers → Active Numbers → [your number] → Voice Configuration
  2. Set A Call Comes In webhook URL to https://yourdomain.com/twilio/voice (HTTP POST).
  3. If testing locally, expose the server with ngrok http 3000 and use the generated HTTPS URL.

Error Handling & Edge Cases

  • Twilio timeout (15 s): If VAPI doesn’t respond, Twilio hangs up. Send a keep‑alive ping to VAPI every 10 s.
  • Audio format mismatch: Twilio streams µ‑law 8 kHz; VAPI expects PCM 16 kHz. Either transcode on the bridge or configure VAPI’s transcriber to accept µ‑law (if supported).
  • Barge‑in: When the user interrupts, send { type: 'cancel' } to VAPI and flush Twilio’s audio buffer to stop the current TTS playback.

Testing & Validation

  1. Call the Twilio number.
  2. Verify in logs:
    • TwiML webhook hit (200 response)
    • WebSocket connection established
    • VAPI assistant initialized
    • Bidirectional audio packets flowing
  3. Latency benchmark: Measure time from end of user speech → start of bot response. Aim for ≈ 1200 ms; higher feels broken.

Common Issues & Fixes

SymptomLikely CauseFix
No audio from botVAPI sends PCM while Twilio expects µ‑lawAdd a transcoding layer or use VAPI’s own telephony provider (bypasses Twilio).
Bot cuts off mid‑sentenceVAD endpointing too lowIncrease transcriber.endpointing to 400 ms.
Webhook failsTwilio requires HTTPSUse ngrok for local testing or deploy with a valid SSL certificate.

System Diagram

graph LR
    Phone[Phone Call]
    Gateway[Call Gateway]
    IVR[Interactive Voice Response]
    STT[Speech‑to‑Text]
    NLU[Intent Detection]
    LLM[Response Generation]
    TTS[Text‑to‑Speech]
    Error[Error Handling]
    Output[Call Output]

    Phone --> Gateway
    Gateway --> IVR
    IVR --> STT
    STT --> NLU
    NLU --> LLM
    LLM --> TTS
    TTS --> Output
    Gateway -->|Call Drop/Error| Error
Back to Blog

Related posts

Read more »