I Built a Full Voice Pipeline on a €399 Edge AI Box (Whisper + Kokoro on Tensor Cores)

Published: 2 days ago (February 10, 2026 at 03:36 PM EST)

3 min read

Source: Dev.to

We just shipped a feature I’ve been wanting for months: full bidirectional voice on local hardware. No cloud. No API keys. No latency. Just a €399 box on your desk that listens and talks back.

The Setup

Hardware: NVIDIA Jetson Orin Nano (67 TOPS, 1024 CUDA cores, tensor cores)
Speech-to-Text: OpenAI Whisper (runs locally on GPU)
Text-to-Speech: Kokoro TTS (82 M params, natural human voice)
AI Brain: OpenClaw (connects to Claude/GPT via API)
Power draw: 15 W total

How It Works

You speak → Whisper (tensor cores) → text → AI thinks → Kokoro (tensor cores) → natural voice response

The entire voice loop runs on the NVIDIA tensor cores. Whisper transcribes your speech in real‑time across 90+ languages. Kokoro generates natural human speech with multiple voice options. The AI brain (OpenClaw) handles the thinking.

Why Tensor Cores Matter

The Jetson Orin Nano’s Ampere GPU has dedicated tensor cores — specialized hardware for matrix operations that AI models depend on. This means:

Whisper runs in real‑time (not 10× slower like on a Raspberry Pi)
Kokoro generates speech faster than real‑time
Both can run simultaneously with CUDA cores to spare
All at 15 W — less than a light bulb

Real‑World Usage

I send a voice message on Telegram. My AI assistant:

Transcribes it locally (Whisper)
Understands the request (Claude API)
Takes action (browser automation, email, calendar)
Responds in natural speech (Kokoro)
Sends back a voice message on Telegram

The whole loop takes a few seconds. No audio ever leaves the device for transcription.

Privacy Angle

Every word you say is processed on your hardware. Your voice data never hits a cloud server for STT/TTS. The only cloud call is to the LLM API — and even that’s optional if you run a local model.

Compare this to Alexa, Siri, or Google Assistant where every utterance is uploaded, stored, and analyzed.

The Numbers

Component	Model	Size	Speed
STT	Whisper Small	461 MB	Real‑time
TTS	Kokoro‑82M	~200 MB	Faster than real‑time
Total VRAM	–	~1 GB	Leaves 7 GB free
Power	–	–	15 W total system

Try It

The hardware is called ClawBox — a pre‑configured Jetson Orin Nano with OpenClaw, Whisper, and Kokoro pre‑installed. Plug in, connect to Telegram, start talking.

Or build your own: grab a Jetson Orin Nano, install OpenClaw, and follow the setup guide.

What’s your local voice setup? Running Piper, XTTS, or something else? Would love to hear what’s working for people.

I Built a Full Voice Pipeline on a €399 Edge AI Box (Whisper + Kokoro on Tensor Cores)

The Setup

How It Works

Why Tensor Cores Matter

Real‑World Usage

Privacy Angle

The Numbers

Try It

Related posts

New article

Build a Serverless RAG Engine for $0

Set up Ollama, NGROK, and LangChain

Com IA ou sem IA, os problemas são os mesmos de sempre.