I Built a Full Voice Pipeline on a €399 Edge AI Box (Whisper + Kokoro on Tensor Cores)
Source: Dev.to
We just shipped a feature I’ve been wanting for months: full bidirectional voice on local hardware. No cloud. No API keys. No latency. Just a €399 box on your desk that listens and talks back.
The Setup
Hardware: NVIDIA Jetson Orin Nano (67 TOPS, 1024 CUDA cores, tensor cores)
Speech-to-Text: OpenAI Whisper (runs locally on GPU)
Text-to-Speech: Kokoro TTS (82 M params, natural human voice)
AI Brain: OpenClaw (connects to Claude/GPT via API)
Power draw: 15 W total
How It Works
You speak → Whisper (tensor cores) → text → AI thinks → Kokoro (tensor cores) → natural voice response
The entire voice loop runs on the NVIDIA tensor cores. Whisper transcribes your speech in real‑time across 90+ languages. Kokoro generates natural human speech with multiple voice options. The AI brain (OpenClaw) handles the thinking.
Why Tensor Cores Matter
The Jetson Orin Nano’s Ampere GPU has dedicated tensor cores — specialized hardware for matrix operations that AI models depend on. This means:
- Whisper runs in real‑time (not 10× slower like on a Raspberry Pi)
- Kokoro generates speech faster than real‑time
- Both can run simultaneously with CUDA cores to spare
- All at 15 W — less than a light bulb
Real‑World Usage
I send a voice message on Telegram. My AI assistant:
- Transcribes it locally (Whisper)
- Understands the request (Claude API)
- Takes action (browser automation, email, calendar)
- Responds in natural speech (Kokoro)
- Sends back a voice message on Telegram
The whole loop takes a few seconds. No audio ever leaves the device for transcription.
Privacy Angle
Every word you say is processed on your hardware. Your voice data never hits a cloud server for STT/TTS. The only cloud call is to the LLM API — and even that’s optional if you run a local model.
Compare this to Alexa, Siri, or Google Assistant where every utterance is uploaded, stored, and analyzed.
The Numbers
| Component | Model | Size | Speed |
|---|---|---|---|
| STT | Whisper Small | 461 MB | Real‑time |
| TTS | Kokoro‑82M | ~200 MB | Faster than real‑time |
| Total VRAM | – | ~1 GB | Leaves 7 GB free |
| Power | – | – | 15 W total system |
Try It
The hardware is called ClawBox — a pre‑configured Jetson Orin Nano with OpenClaw, Whisper, and Kokoro pre‑installed. Plug in, connect to Telegram, start talking.
Or build your own: grab a Jetson Orin Nano, install OpenClaw, and follow the setup guide.
What’s your local voice setup? Running Piper, XTTS, or something else? Would love to hear what’s working for people.