I Built a Real-Time Voice AI in 50 Minutes. Here's How (and Why)
Source: Dev.to
The Challenge
I wanted to build something ambitious: a system where you could talk to AI clones of anyone and get responses in their actual voice.
Requirements
- Real‑time voice processing
- Sub‑second latency
- No app installation needed (just scan a QR code)
- Ethical voice cloning
- Works with free/cheap API tiers
Traditionally, this would take days of debugging WebSocket issues, API rate limiting, and voice‑synthesis integration. I wanted to see how fast it could actually be done.
Architecture Overview
Your Phone → Whisper (STT) → OpenRouter (LLM) → VoiSpark (TTS) → Your Ears
- Speech‑to‑Text: OpenAI Whisper
- LLM: OpenRouter (access to Claude, GPT‑4, Llama, and more)
- Text‑to‑Speech + Voice Cloning: VoiSpark
- Transport: WebSocket (low latency, bidirectional)
- Infrastructure: Node.js + Express + ngrok
- Frontend: Next.js + TailwindCSS
Flow
- You speak into your phone (no app install — just scan a QR code).
- Audio is streamed to the backend via WebSocket.
- Whisper transcribes your speech to text.
- OpenRouter sends the text to the chosen LLM with a persona prompt.
- The LLM response is synthesized by VoiSpark in the cloned voice.
- Audio is streamed back — you hear the answer in their voice.
Total round‑trip latency: sub‑second.
Quick Start
git clone https://github.com/MatheusSimonaci/clone-talking
cd clone-talking
npm install
# Set your API keys in .env
npm start
# Open http://localhost:3000
# Scan the QR code from your phone
# Start talking
You need four free‑tier API keys:
- OpenAI (Whisper)
- OpenRouter
- VoiSpark
- ngrok
Ethical Considerations
Voice cloning is powerful — and risky. I deliberately chose a TTS provider that explicitly allows synthetic voice generation within their terms of service. Building cool tech should not ignore ethics.
Features
- Custom voice training (upload your own voice sample)
- Multi‑language support
- Conversation memory across sessions
- Integration with external knowledge bases
Contributing
Contributions are welcome under the MIT License.
GitHub repository:
Demo video: