使用 VAPI 实现实时流媒体:构建语音应用

发布: (2025年12月11日 GMT+8 06:26)
6 min read
原文: Dev.to

Source: Dev.to

TL;DR

大多数语音应用在网络抖动超过 200 ms 或用户在句子中途打断时会崩溃。本文档展示了如何使用 VAPI 的 WebRTC 语音集成和 Twilio 通话路由,构建生产级的流式语音应用。你将学习实时音频处理、实现正确的抢话检测以及在没有竞争条件的情况下管理会话状态。结果: 响应延迟低于 500 ms,且能优雅地处理中断。

API Access & Authentication

  • VAPI API key – 从 dashboard.vapi.ai 获取
  • Twilio Account SIDAuth Token – 在 Twilio 控制台获取
  • Twilio phone number – 已启用语音功能的号码

Development Environment

  • Node.js 18+(需要原生 fetch 支持以调用流式 API)
  • 用于 webhook 的公网 HTTPS 端点(如 ngrok、Railway 或生产域名)
  • 有效的 SSL 证书(WebRTC 语音集成的强制要求)

Network Requirements

  • 出站 HTTPS(443 端口)用于 VAPI/Twilio API 调用
  • 入站 webhook 接收器必须在 5 s 超时内响应
  • 支持 WebSocket 的实时语音流连接

Technical Knowledge

  • Async/await 模式(流式音频处理是非阻塞的)
  • webhook 签名验证(安全不可忽视)
  • 基本 PCM 音频格式(16 kHz、16‑bit)用于语音应用

Cost Awareness

  • VAPI 按语音流分钟计费
  • Twilio 按通话 + 每分钟使用量计费,用于交互式语音响应(IVR)系统

Streaming Implementation Details

大多数流式实现失败是因为把 VAPI 当作传统的 REST API 来使用。VAPI 需要一个 有状态的 WebSocket 来承载双向音频流。

// Server‑side assistant configuration – production grade
const assistantConfig = {
  model: {
    provider: "openai",
    model: "gpt-4",
    temperature: 0.7,
    messages: [
      {
        role: "system",
        content: "You are a voice assistant. Keep responses under 2 sentences."
      }
    ]
  },
  voice: {
    provider: "11labs",
    voiceId: "21m00Tcm4TlvDq8ikWAM",
    stability: 0.5,
    similarityBoost: 0.75
  },
  transcriber: {
    provider: "deepgram",
    model: "nova-2",
    language: "en-US"
  },
  firstMessage: "How can I help you today?",
  endCallMessage: "Thanks for calling. Goodbye.",
  recordingEnabled: true
};

Note: The transcriber config is critical. Default models add 200‑400 ms latency; Deepgram’s nova-2 reduces this to 80‑120 ms at a higher cost.

Architecture diagram

flowchart LR
    A[User Browser] -->|WebSocket| B[VAPI SDK]
    B -->|Audio Stream| C[VAPI Platform]
    C -->|STT| D[Deepgram]
    C -->|LLM| E[OpenAI]
    C -->|TTS| F[ElevenLabs]
    C -->|Events| G[Your Webhook Server]
    G -->|Function Results| C

音频在 VAPI 平台内部流动,经过你的后端。代理音频会导致 500 ms+ 的延迟并破坏流式传输。

Client‑Side Setup

import Vapi from "@vapi-ai/web";

const vapi = new Vapi(process.env.VAPI_PUBLIC_KEY);

// Set up event handlers **before** starting the stream
vapi.on("call-start", () => {
  console.log("Stream active");
  isProcessing = false; // reset race‑condition guard
});

vapi.on("speech-start", () => {
  console.log("User speaking – cancel any queued TTS");
});

vapi.on("message", (message) => {
  if (message.type === "transcript" && message.transcriptType === "partial") {
    // Show live transcription – do NOT act on it yet
    updateUI(message.transcript);
  }
});

vapi.on("error", (error) => {
  console.error("Stream error:", error);
  // Implement retry logic for mobile network drops
});

// Start the streaming call
await vapi.start(assistantConfig);

Race‑condition warning: 只处理 transcriptType === "final",以避免重复的 LLM 请求。

Server‑Side Webhook

const express = require('express');
const crypto = require('crypto');
const app = express();

app.use(express.json());

// Validate webhook signature – mandatory
function validateSignature(req) {
  const signature = req.headers['x-vapi-signature'];
  const payload = JSON.stringify(req.body);
  const hash = crypto
    .createHmac('sha256', process.env.VAPI_SERVER_SECRET)
    .update(payload)
    .digest('hex');
  return signature === hash;
}

app.post('/webhook/vapi', async (req, res) => {
  if (!validateSignature(req)) {
    return res.status(401).json({ error: 'Invalid signature' });
  }

  const { message } = req.body;

  // Handle function calls from the assistant
  if (message.type === 'function-call') {
    const { functionCall } = message;
    try {
      // **Timeout trap:** VAPI expects a response within 5 seconds.
      // If a function needs more time, return immediately and use a callback mechanism;
      // otherwise the call drops.
    } catch (err) {
      console.error('Function call error:', err);
    }
  }

  res.status(200).json({ received: true });
});

Audio Processing Pipeline

graph LR
    Mic[Microphone Input] --> AudioBuf[Audio Buffering]
    AudioBuf --> VAD[Voice Activity Detection]
    VAD -->|Detected| STT[Speech‑to‑Text]
    VAD -->|Not Detected| Error[Error Handling]
    STT --> NLU[Intent Recognition]
    NLU --> API[API Integration]
    API --> LLM[Response Generation]
    LLM --> TTS[Text‑to‑Speech]
    TTS --> Speaker[Speaker Output]
    Error -->|Retry| AudioBuf
    Error -->|Fail| Speaker

Local Testing

Run your server and expose it

# Terminal 1 – start the webhook server
node server.js

# Terminal 2 – expose via ngrok
ngrok http 3000

# Terminal 3 – forward VAPI webhooks to the public URL
vapi webhooks forward https://.ngrok.io/webhook/vapi

Add debug logging to the webhook

app.post('/webhook', (req, res) => {
  const { message } = req.body;

  console.log('Event received:', {
    type: message.type,
    timestamp: new Date().toISOString(),
    callId: message.call?.id,
    payload: JSON.stringify(message, null, 2)
  });

  // Validate signature before processing
  const isValid = validateSignature(req);
  if (!isValid) {
    console.error('Invalid signature – potential security issue');
    return res.status(401).json({ error: 'Invalid signature' });
  }

  res.status(200).json({ received: true });
});

Verify signature validation with curl

# Expected to fail with 401 Unauthorized
curl -X POST http://localhost:3000/webhook \
  -H "Content-Type: application/json" \
  -H "x-vapi-signature: invalid_signature" \
  -d '{"message":{"type":"status-update"}}'

监控响应时间以保持在 5 s webhook 超时以内,并记录所有签名验证失败——它们通常表明配置不匹配或重放攻击。

使用这些模式后,你就可以部署一个稳健、低延迟的流式语音应用,能够优雅地处理中断和网络波动。

Back to Blog

相关文章

阅读更多 »