I Built a Real-Time Voice AI in 50 Minutes. Here's How (and Why)

Published: (May 4, 2026 at 01:53 PM EDT)
2 min read
Source: Dev.to

Source: Dev.to

The Challenge

I wanted to build something ambitious: a system where you could talk to AI clones of anyone and get responses in their actual voice.

Requirements

  • Real‑time voice processing
  • Sub‑second latency
  • No app installation needed (just scan a QR code)
  • Ethical voice cloning
  • Works with free/cheap API tiers

Traditionally, this would take days of debugging WebSocket issues, API rate limiting, and voice‑synthesis integration. I wanted to see how fast it could actually be done.

Architecture Overview

Your Phone → Whisper (STT) → OpenRouter (LLM) → VoiSpark (TTS) → Your Ears
  • Speech‑to‑Text: OpenAI Whisper
  • LLM: OpenRouter (access to Claude, GPT‑4, Llama, and more)
  • Text‑to‑Speech + Voice Cloning: VoiSpark
  • Transport: WebSocket (low latency, bidirectional)
  • Infrastructure: Node.js + Express + ngrok
  • Frontend: Next.js + TailwindCSS

Flow

  1. You speak into your phone (no app install — just scan a QR code).
  2. Audio is streamed to the backend via WebSocket.
  3. Whisper transcribes your speech to text.
  4. OpenRouter sends the text to the chosen LLM with a persona prompt.
  5. The LLM response is synthesized by VoiSpark in the cloned voice.
  6. Audio is streamed back — you hear the answer in their voice.

Total round‑trip latency: sub‑second.

Quick Start

git clone https://github.com/MatheusSimonaci/clone-talking
cd clone-talking
npm install
# Set your API keys in .env
npm start
# Open http://localhost:3000
# Scan the QR code from your phone
# Start talking

You need four free‑tier API keys:

  • OpenAI (Whisper)
  • OpenRouter
  • VoiSpark
  • ngrok

Ethical Considerations

Voice cloning is powerful — and risky. I deliberately chose a TTS provider that explicitly allows synthetic voice generation within their terms of service. Building cool tech should not ignore ethics.

Features

  • Custom voice training (upload your own voice sample)
  • Multi‑language support
  • Conversation memory across sessions
  • Integration with external knowledge bases

Contributing

Contributions are welcome under the MIT License.

GitHub repository:

Demo video:

0 views
Back to Blog

Related posts

Read more »