VAPI를 활용한 실시간 스트리밍 구현: 음성 앱 만들기

발행: 1주 전 (2025년 12월 11일 오전 07:26 GMT+9)

6 min read

Source: Dev.to

TL;DR

대부분의 음성 앱은 네트워크 지터가 200 ms를 초과하거나 사용자가 문장 중간에 끊을 때 오류가 발생합니다. 이 가이드는 VAPI의 WebRTC 음성 통합과 Twilio를 이용한 콜 라우팅을 사용해 프로덕션 수준의 스트리밍 음성 애플리케이션을 구축하는 방법을 보여줍니다. 실시간 오디오 처리, 올바른 바지인(barge‑in) 감지, 레이스 컨디션 없이 세션 상태를 관리하는 방법을 다룹니다. Outcome: 500 ms 미만의 응답 지연과 부드러운 중단 처리.

API Access & Authentication

VAPI API key – dashboard.vapi.ai에서 발급
Twilio Account SID 및 Auth Token – Twilio 콘솔에서 확인
음성 기능이 활성화된 Twilio 전화번호

Development Environment

Node.js 18+ (스트리밍 API에 필요한 네이티브 fetch 지원)
웹훅용 공개 HTTPS 엔드포인트 (예: ngrok, Railway, 또는 실제 도메인)
유효한 SSL 인증서 (WebRTC 음성 통합에 필수)

Network Requirements

VAPI/Twilio API 호출을 위한 아웃바운드 HTTPS(포트 443)
인바운드 웹훅 수신은 5 s 이내에 응답해야 함
실시간 음성 스트리밍 연결을 위한 WebSocket 지원

Technical Knowledge

Async/await 패턴 (스트리밍 오디오 처리는 논블로킹)
웹훅 서명 검증 (보안은 선택 사항이 아님)
기본 PCM 오디오 포맷(16 kHz, 16‑bit) – 음성 애플리케이션용

Cost Awareness

VAPI는 음성 스트리밍 분당 요금을 부과
Twilio는 통화 + 분당 사용량에 대해 청구 (IVR 시스템)

Streaming Implementation Details

대부분의 스트리밍 구현이 실패하는 이유는 VAPI를 기존 REST API처럼 다루기 때문입니다. VAPI는 양방향 오디오 스트림을 전달하는 상태 유지 WebSocket이 필요합니다.

// Server‑side assistant configuration – production grade
const assistantConfig = {
  model: {
    provider: "openai",
    model: "gpt-4",
    temperature: 0.7,
    messages: [
      {
        role: "system",
        content: "You are a voice assistant. Keep responses under 2 sentences."
      }
    ]
  },
  voice: {
    provider: "11labs",
    voiceId: "21m00Tcm4TlvDq8ikWAM",
    stability: 0.5,
    similarityBoost: 0.75
  },
  transcriber: {
    provider: "deepgram",
    model: "nova-2",
    language: "en-US"
  },
  firstMessage: "How can I help you today?",
  endCallMessage: "Thanks for calling. Goodbye.",
  recordingEnabled: true
};

Note: The transcriber config is critical. Default models add 200‑400 ms latency; Deepgram’s nova-2 reduces this to 80‑120 ms at a higher cost.

Architecture diagram

flowchart LR
    A[User Browser] -->|WebSocket| B[VAPI SDK]
    B -->|Audio Stream| C[VAPI Platform]
    C -->|STT| D[Deepgram]
    C -->|LLM| E[OpenAI]
    C -->|TTS| F[ElevenLabs]
    C -->|Events| G[Your Webhook Server]
    G -->|Function Results| C

오디오는 백엔드가 아니라 VAPI 플랫폼을 통해 흐릅니다. 오디오를 프록시하면 500 ms 이상의 지연이 발생하고 스트리밍이 깨집니다.

Client‑Side Setup

import Vapi from "@vapi-ai/web";

const vapi = new Vapi(process.env.VAPI_PUBLIC_KEY);

// Set up event handlers **before** starting the stream
vapi.on("call-start", () => {
  console.log("Stream active");
  isProcessing = false; // reset race‑condition guard
});

vapi.on("speech-start", () => {
  console.log("User speaking – cancel any queued TTS");
});

vapi.on("message", (message) => {
  if (message.type === "transcript" && message.transcriptType === "partial") {
    // Show live transcription – do NOT act on it yet
    updateUI(message.transcript);
  }
});

vapi.on("error", (error) => {
  console.error("Stream error:", error);
  // Implement retry logic for mobile network drops
});

// Start the streaming call
await vapi.start(assistantConfig);

Race‑condition warning: transcriptType === "final"인 경우에만 처리하여 중복 LLM 요청을 방지합니다.

Server‑Side Webhook

const express = require('express');
const crypto = require('crypto');
const app = express();

app.use(express.json());

// Validate webhook signature – mandatory
function validateSignature(req) {
  const signature = req.headers['x-vapi-signature'];
  const payload = JSON.stringify(req.body);
  const hash = crypto
    .createHmac('sha256', process.env.VAPI_SERVER_SECRET)
    .update(payload)
    .digest('hex');
  return signature === hash;
}

app.post('/webhook/vapi', async (req, res) => {
  if (!validateSignature(req)) {
    return res.status(401).json({ error: 'Invalid signature' });
  }

  const { message } = req.body;

  // Handle function calls from the assistant
  if (message.type === 'function-call') {
    const { functionCall } = message;
    try {
      // **Timeout trap:** VAPI expects a response within 5 seconds.
      // If a function needs more time, return immediately and use a callback mechanism;
      // otherwise the call drops.
    } catch (err) {
      console.error('Function call error:', err);
    }
  }

  res.status(200).json({ received: true });
});

Audio Processing Pipeline

graph LR
    Mic[Microphone Input] --> AudioBuf[Audio Buffering]
    AudioBuf --> VAD[Voice Activity Detection]
    VAD -->|Detected| STT[Speech‑to‑Text]
    VAD -->|Not Detected| Error[Error Handling]
    STT --> NLU[Intent Recognition]
    NLU --> API[API Integration]
    API --> LLM[Response Generation]
    LLM --> TTS[Text‑to‑Speech]
    TTS --> Speaker[Speaker Output]
    Error -->|Retry| AudioBuf
    Error -->|Fail| Speaker

Local Testing

Run your server and expose it

# Terminal 1 – start the webhook server
node server.js

# Terminal 2 – expose via ngrok
ngrok http 3000

# Terminal 3 – forward VAPI webhooks to the public URL
vapi webhooks forward https://.ngrok.io/webhook/vapi

Add debug logging to the webhook

app.post('/webhook', (req, res) => {
  const { message } = req.body;

  console.log('Event received:', {
    type: message.type,
    timestamp: new Date().toISOString(),
    callId: message.call?.id,
    payload: JSON.stringify(message, null, 2)
  });

  // Validate signature before processing
  const isValid = validateSignature(req);
  if (!isValid) {
    console.error('Invalid signature – potential security issue');
    return res.status(401).json({ error: 'Invalid signature' });
  }

  res.status(200).json({ received: true });
});

Verify signature validation with `curl`

# Expected to fail with 401 Unauthorized
curl -X POST http://localhost:3000/webhook \
  -H "Content-Type: application/json" \
  -H "x-vapi-signature: invalid_signature" \
  -d '{"message":{"type":"status-update"}}'

응답 시간이 5 s 웹훅 타임아웃 이하인지 모니터링하고, 검증 실패 로그를 확인하세요. 이러한 로그는 설정 불일치나 재생 공격을 나타낼 수 있습니다.

이러한 패턴을 적용하면 중단 및 네트워크 변동성을 부드럽게 처리하면서도 낮은 지연을 유지하는 견고한 스트리밍 음성 앱을 배포할 수 있습니다.

VAPI를 활용한 실시간 스트리밍 구현: 음성 앱 만들기

TL;DR

API Access & Authentication

Development Environment

Network Requirements

Technical Knowledge

Cost Awareness

Streaming Implementation Details

Architecture diagram

Client‑Side Setup

Server‑Side Webhook

Audio Processing Pipeline

Local Testing

Run your server and expose it

Add debug logging to the webhook

Verify signature validation with `curl`

관련 글

우리 사이트가 싱가포르에서는 느리고 유럽에서는 완벽했는데, 그 이유는.

나는 Game Boy를 ChatGPT 안에 넣었다 (ChatGPT Apps)

Microsoft Planner를 사용하는 마케팅 매니저의 하루

spaceorbust – GitHub 커밋으로 우주 문명을 움직이는 터미널 RPG

TL;DR

API Access & Authentication

Development Environment

Network Requirements

Technical Knowledge

Cost Awareness

Streaming Implementation Details

Architecture diagram

Client‑Side Setup

Server‑Side Webhook

Audio Processing Pipeline

Local Testing

Run your server and expose it

Add debug logging to the webhook

Verify signature validation with curl

관련 글

우리 사이트가 싱가포르에서는 느리고 유럽에서는 완벽했는데, 그 이유는.

나는 Game Boy를 ChatGPT 안에 넣었다 (ChatGPT Apps)

Microsoft Planner를 사용하는 마케팅 매니저의 하루

spaceorbust – GitHub 커밋으로 우주 문명을 움직이는 터미널 RPG

Verify signature validation with `curl`