I Built a Full Voice Pipeline on a €399 Edge AI Box (Whisper + Kokoro on Tensor Cores)

Published: (February 10, 2026 at 03:36 PM EST)
3 min read
Source: Dev.to

Source: Dev.to

We just shipped a feature I’ve been wanting for months: full bidirectional voice on local hardware. No cloud. No API keys. No latency. Just a €399 box on your desk that listens and talks back.

The Setup

Hardware: NVIDIA Jetson Orin Nano (67 TOPS, 1024 CUDA cores, tensor cores)
Speech-to-Text: OpenAI Whisper (runs locally on GPU)
Text-to-Speech: Kokoro TTS (82 M params, natural human voice)
AI Brain: OpenClaw (connects to Claude/GPT via API)
Power draw: 15 W total

How It Works

You speak → Whisper (tensor cores) → text → AI thinks → Kokoro (tensor cores) → natural voice response

The entire voice loop runs on the NVIDIA tensor cores. Whisper transcribes your speech in real‑time across 90+ languages. Kokoro generates natural human speech with multiple voice options. The AI brain (OpenClaw) handles the thinking.

Why Tensor Cores Matter

The Jetson Orin Nano’s Ampere GPU has dedicated tensor cores — specialized hardware for matrix operations that AI models depend on. This means:

  • Whisper runs in real‑time (not 10× slower like on a Raspberry Pi)
  • Kokoro generates speech faster than real‑time
  • Both can run simultaneously with CUDA cores to spare
  • All at 15 W — less than a light bulb

Real‑World Usage

I send a voice message on Telegram. My AI assistant:

  • Transcribes it locally (Whisper)
  • Understands the request (Claude API)
  • Takes action (browser automation, email, calendar)
  • Responds in natural speech (Kokoro)
  • Sends back a voice message on Telegram

The whole loop takes a few seconds. No audio ever leaves the device for transcription.

Privacy Angle

Every word you say is processed on your hardware. Your voice data never hits a cloud server for STT/TTS. The only cloud call is to the LLM API — and even that’s optional if you run a local model.

Compare this to Alexa, Siri, or Google Assistant where every utterance is uploaded, stored, and analyzed.

The Numbers

ComponentModelSizeSpeed
STTWhisper Small461 MBReal‑time
TTSKokoro‑82M~200 MBFaster than real‑time
Total VRAM~1 GBLeaves 7 GB free
Power15 W total system

Try It

The hardware is called ClawBox — a pre‑configured Jetson Orin Nano with OpenClaw, Whisper, and Kokoro pre‑installed. Plug in, connect to Telegram, start talking.

Or build your own: grab a Jetson Orin Nano, install OpenClaw, and follow the setup guide.

What’s your local voice setup? Running Piper, XTTS, or something else? Would love to hear what’s working for people.

0 views
Back to Blog

Related posts

Read more »

New article

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as we...

Build a Serverless RAG Engine for $0

Introduction: The Problem with “Toy” RAG Apps Most RAG tutorials skip the hard parts that actually matter in production: - No security model: Users can access...

Set up Ollama, NGROK, and LangChain

markdown !Breno A. V.https://media2.dev.to/dynamic/image/width=50,height=50,fit=cover,gravity=auto,format=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fu...