使用 VAPI 实现实时流媒体:构建语音应用
Source: Dev.to
TL;DR
大多数语音应用在网络抖动超过 200 ms 或用户在句子中途打断时会崩溃。本文档展示了如何使用 VAPI 的 WebRTC 语音集成和 Twilio 通话路由,构建生产级的流式语音应用。你将学习实时音频处理、实现正确的抢话检测以及在没有竞争条件的情况下管理会话状态。结果: 响应延迟低于 500 ms,且能优雅地处理中断。
API Access & Authentication
- VAPI API key – 从
dashboard.vapi.ai获取 - Twilio Account SID 和 Auth Token – 在 Twilio 控制台获取
- Twilio phone number – 已启用语音功能的号码
Development Environment
- Node.js 18+(需要原生
fetch支持以调用流式 API) - 用于 webhook 的公网 HTTPS 端点(如 ngrok、Railway 或生产域名)
- 有效的 SSL 证书(WebRTC 语音集成的强制要求)
Network Requirements
- 出站 HTTPS(443 端口)用于 VAPI/Twilio API 调用
- 入站 webhook 接收器必须在 5 s 超时内响应
- 支持 WebSocket 的实时语音流连接
Technical Knowledge
- Async/await 模式(流式音频处理是非阻塞的)
- webhook 签名验证(安全不可忽视)
- 基本 PCM 音频格式(16 kHz、16‑bit)用于语音应用
Cost Awareness
- VAPI 按语音流分钟计费
- Twilio 按通话 + 每分钟使用量计费,用于交互式语音响应(IVR)系统
Streaming Implementation Details
大多数流式实现失败是因为把 VAPI 当作传统的 REST API 来使用。VAPI 需要一个 有状态的 WebSocket 来承载双向音频流。
// Server‑side assistant configuration – production grade
const assistantConfig = {
model: {
provider: "openai",
model: "gpt-4",
temperature: 0.7,
messages: [
{
role: "system",
content: "You are a voice assistant. Keep responses under 2 sentences."
}
]
},
voice: {
provider: "11labs",
voiceId: "21m00Tcm4TlvDq8ikWAM",
stability: 0.5,
similarityBoost: 0.75
},
transcriber: {
provider: "deepgram",
model: "nova-2",
language: "en-US"
},
firstMessage: "How can I help you today?",
endCallMessage: "Thanks for calling. Goodbye.",
recordingEnabled: true
};
Note: The transcriber config is critical. Default models add 200‑400 ms latency; Deepgram’s
nova-2reduces this to 80‑120 ms at a higher cost.
Architecture diagram
flowchart LR
A[User Browser] -->|WebSocket| B[VAPI SDK]
B -->|Audio Stream| C[VAPI Platform]
C -->|STT| D[Deepgram]
C -->|LLM| E[OpenAI]
C -->|TTS| F[ElevenLabs]
C -->|Events| G[Your Webhook Server]
G -->|Function Results| C
音频在 VAPI 平台内部流动,不经过你的后端。代理音频会导致 500 ms+ 的延迟并破坏流式传输。
Client‑Side Setup
import Vapi from "@vapi-ai/web";
const vapi = new Vapi(process.env.VAPI_PUBLIC_KEY);
// Set up event handlers **before** starting the stream
vapi.on("call-start", () => {
console.log("Stream active");
isProcessing = false; // reset race‑condition guard
});
vapi.on("speech-start", () => {
console.log("User speaking – cancel any queued TTS");
});
vapi.on("message", (message) => {
if (message.type === "transcript" && message.transcriptType === "partial") {
// Show live transcription – do NOT act on it yet
updateUI(message.transcript);
}
});
vapi.on("error", (error) => {
console.error("Stream error:", error);
// Implement retry logic for mobile network drops
});
// Start the streaming call
await vapi.start(assistantConfig);
Race‑condition warning: 只处理 transcriptType === "final",以避免重复的 LLM 请求。
Server‑Side Webhook
const express = require('express');
const crypto = require('crypto');
const app = express();
app.use(express.json());
// Validate webhook signature – mandatory
function validateSignature(req) {
const signature = req.headers['x-vapi-signature'];
const payload = JSON.stringify(req.body);
const hash = crypto
.createHmac('sha256', process.env.VAPI_SERVER_SECRET)
.update(payload)
.digest('hex');
return signature === hash;
}
app.post('/webhook/vapi', async (req, res) => {
if (!validateSignature(req)) {
return res.status(401).json({ error: 'Invalid signature' });
}
const { message } = req.body;
// Handle function calls from the assistant
if (message.type === 'function-call') {
const { functionCall } = message;
try {
// **Timeout trap:** VAPI expects a response within 5 seconds.
// If a function needs more time, return immediately and use a callback mechanism;
// otherwise the call drops.
} catch (err) {
console.error('Function call error:', err);
}
}
res.status(200).json({ received: true });
});
Audio Processing Pipeline
graph LR
Mic[Microphone Input] --> AudioBuf[Audio Buffering]
AudioBuf --> VAD[Voice Activity Detection]
VAD -->|Detected| STT[Speech‑to‑Text]
VAD -->|Not Detected| Error[Error Handling]
STT --> NLU[Intent Recognition]
NLU --> API[API Integration]
API --> LLM[Response Generation]
LLM --> TTS[Text‑to‑Speech]
TTS --> Speaker[Speaker Output]
Error -->|Retry| AudioBuf
Error -->|Fail| Speaker
Local Testing
Run your server and expose it
# Terminal 1 – start the webhook server
node server.js
# Terminal 2 – expose via ngrok
ngrok http 3000
# Terminal 3 – forward VAPI webhooks to the public URL
vapi webhooks forward https://.ngrok.io/webhook/vapi
Add debug logging to the webhook
app.post('/webhook', (req, res) => {
const { message } = req.body;
console.log('Event received:', {
type: message.type,
timestamp: new Date().toISOString(),
callId: message.call?.id,
payload: JSON.stringify(message, null, 2)
});
// Validate signature before processing
const isValid = validateSignature(req);
if (!isValid) {
console.error('Invalid signature – potential security issue');
return res.status(401).json({ error: 'Invalid signature' });
}
res.status(200).json({ received: true });
});
Verify signature validation with curl
# Expected to fail with 401 Unauthorized
curl -X POST http://localhost:3000/webhook \
-H "Content-Type: application/json" \
-H "x-vapi-signature: invalid_signature" \
-d '{"message":{"type":"status-update"}}'
监控响应时间以保持在 5 s webhook 超时以内,并记录所有签名验证失败——它们通常表明配置不匹配或重放攻击。
使用这些模式后,你就可以部署一个稳健、低延迟的流式语音应用,能够优雅地处理中断和网络波动。