I Built a Real-Time Voice AI in 50 Minutes. Here's How (and Why)

Published: 2 hours ago (May 4, 2026 at 01:53 PM EDT)

2 min read

Source: Dev.to

The Challenge

I wanted to build something ambitious: a system where you could talk to AI clones of anyone and get responses in their actual voice.

Requirements

Real‑time voice processing
Sub‑second latency
No app installation needed (just scan a QR code)
Ethical voice cloning
Works with free/cheap API tiers

Traditionally, this would take days of debugging WebSocket issues, API rate limiting, and voice‑synthesis integration. I wanted to see how fast it could actually be done.

Architecture Overview

Your Phone → Whisper (STT) → OpenRouter (LLM) → VoiSpark (TTS) → Your Ears

Speech‑to‑Text: OpenAI Whisper
LLM: OpenRouter (access to Claude, GPT‑4, Llama, and more)
Text‑to‑Speech + Voice Cloning: VoiSpark
Transport: WebSocket (low latency, bidirectional)
Infrastructure: Node.js + Express + ngrok
Frontend: Next.js + TailwindCSS

Flow

You speak into your phone (no app install — just scan a QR code).
Audio is streamed to the backend via WebSocket.
Whisper transcribes your speech to text.
OpenRouter sends the text to the chosen LLM with a persona prompt.
The LLM response is synthesized by VoiSpark in the cloned voice.
Audio is streamed back — you hear the answer in their voice.

Total round‑trip latency: sub‑second.

Quick Start

git clone https://github.com/MatheusSimonaci/clone-talking
cd clone-talking
npm install
# Set your API keys in .env
npm start
# Open http://localhost:3000
# Scan the QR code from your phone
# Start talking

You need four free‑tier API keys:

OpenAI (Whisper)
OpenRouter
VoiSpark
ngrok

Ethical Considerations

Voice cloning is powerful — and risky. I deliberately chose a TTS provider that explicitly allows synthetic voice generation within their terms of service. Building cool tech should not ignore ethics.

Features

Custom voice training (upload your own voice sample)
Multi‑language support
Conversation memory across sessions
Integration with external knowledge bases

Contributing

Contributions are welcome under the MIT License.

GitHub repository:

Demo video:

I Built a Real-Time Voice AI in 50 Minutes. Here's How (and Why)

The Challenge

Quick Start

Ethical Considerations

Features

Contributing

Related posts

The Folder Structure That Makes Client Handoffs Painless

Retrospective: Migrating from Nginx to Kong 3.0 Improved API Observability 40%

'Why I stopped trusting npm audit (and built my own)'

Workspace 2026 : May the Fourth Be With You — The Rise of Ephemeral Dev Environments