如何在语音 AI 中优先考虑自然度：实现 VAD

发布: 6天前 (2025年12月12日 GMT+8 14:39)

4 min read

Source: Dev.to

TL;DR

大多数语音 AI 在用户中途打断或停顿思考时会出现问题——机器人要么抢话，要么把用户的话截断。语音活动检测（VAD）通过实时检测语音边界来解决此问题，实现自然的轮流发言和抢话处理。配置 VAPI 的 VAD 阈值，添加反馈性提示（例如 “mm‑hmm”），在中断时清空音频缓冲区，以避免重叠。这样得到的对话更像人类，而不是机器人。

API Access & Authentication

VAPI API key – 从 dashboard.vapi.ai 获取
Twilio Account SID 和 Auth Token – 用于电话号码配置

Technical Requirements

用于 webhook 处理的公网 HTTPS 端点（ngrok 可用于本地开发）
Node.js 18+，配合 npm 或 yarn
基本的 WebSocket 连接和事件驱动架构知识
熟悉 JavaScript 中的 async/await

Voice AI Fundamentals

VAD 阈值及其对延迟的影响
轮流发言机制（检测用户何时停止说话）
抢话行为（在机器人说话时中途打断）
实时音频流约束（16 kHz PCM，μ‑law 编码）

Production Considerations

预算：约 $0.02–$0.05 每分钟，包含 STT + TTS

Audio Processing Pipeline

graph TD
    AudioCapture[Audio Capture] --> VAD[Voice Activity Detection]
    VAD --> STT[Speech‑to‑Text]
    STT --> LLM[Large Language Model]
    LLM --> TTS[Text‑to‑Speech]
    TTS --> AudioOutput[Audio Output]

    STT -->|Error| ErrorHandling[Error Handling]
    LLM -->|Error| ErrorHandling
    TTS -->|Error| ErrorHandling
    ErrorHandling -->|Retry| AudioCapture

该管道以 20 ms 为单位处理音频帧：

用户说话 → 音频以 20 ms 帧缓冲
VAD 分析能量水平
若静音 ≥ endpointing 时长 → 将缓冲区冲刷到 STT
将转录结果发送给 LLM → 合成响应 → 实时回流

当 VAD 在 STT 仍在处理前一个块时触发，可能导致重复响应的竞争条件。请通过显式的轮次状态跟踪来防止此问题。

Step 1: Configure Twilio for Inbound Calls

// Your server receives Twilio webhook
app.post('/voice/inbound', async (req, res) => {
  const twiml = `
    <?xml version="1.0" encoding="UTF-8"?>
    <Response>
      <Gather input="speech" action="/voice/handle" method="POST">
        <Say>Welcome, please tell me how I can help.</Say>
      </Gather>
    </Response>
  `;
  res.type('text/xml');
  res.send(twiml);
});

Step 2: Implement Backchanneling via Prompt Engineering

const systemPrompt = `You are a natural conversationalist. Rules:
1. Use backchannels ("mm-hmm", "I see", "go on") when user pauses mid‑thought.
2. Detect incomplete sentences (trailing "and...", "so...") and wait.
3. Keep responses under 15 words unless the user asks for detail.
4. Never say "How can I help you?" – jump straight to the topic.`;

反馈性提示由 LLM 生成，而不是由 VAD 生成。

Step 3: Handle Barge‑in at the Audio Buffer Level

const callConfig = {
  assistant: assistantConfig,
  backgroundSound: "office", // Enables barge‑in detection
  recordingEnabled: true
};

当 VAD 在 TTS 播放期间检测到新语音时，必须立即清空音频缓冲区。设置 backgroundSound 后，VAPI 会自动完成此操作。

Testing Guidelines

Pause test：拨打你的 Twilio 号码，说一句话后暂停 300 ms，再继续。机器人不应打断。如果出现打断，请以 50 ms 为步长增加 endpointing。
Barge‑in test：在机器人说话时开始说话。音频应在约 200 ms 内被切断。确认已启用 backgroundSound。
Noise robustness：在嘈杂环境（咖啡店、车内）测试。如果出现误触发，请将 endpointing 提高到 300 ms 以上。

Example VAD Threshold Test

const testVADConfig = {
  transcriber: {
    provider: "deepgram",
    model: "nova-2",
    language: "en",
    endpointing: 200 // aggressive start
  }
};

async function testBargeIn() {
  const startTime = Date.now();
  console.log('Testing barge‑in at 1.2 s into TTS playback...');
  if (Date.now() - startTime > 300) {
    console.error('VAD latency exceeded 300 ms – adjust endpointing');
  }
}
testBargeIn();

Webhook Signature Verification

// Verify VAPI webhook signature
app.post('/webhook/vapi', (req, res) => {
  const signature = req.headers['x-vapi-signature'];
  const secret = process.env.VAPI_SECRET;
  // (Insert HMAC verification logic here)
  if (!isValidSignature(signature, secret, req.body)) {
    return res.status(401).send('Invalid signature');
  }
  // Process webhook payload...
  res.sendStatus(200);
});