I Cloned a Family Voice for My Google Home. Here's the Real Story.

Published: (March 23, 2026 at 01:19 PM EDT)
4 min read
Source: Dev.to

Source: Dev.to

The Problem with Cloud TTS for Family Announcements

I was using Sarvam.AI’s Bulbul v3 for Kannada TTS — good quality, but it required a cloud API call for every announcement. For a “wake up, school in 20 minutes” message this added latency and created a dependency on an external service. More importantly, the voice sounded like a stranger. I wanted the house to speak with a familiar voice, so I turned to LuxTTS, an open‑source voice‑cloning model that can generate speech from a 3‑second audio sample.

Attempt 1: Raspberry Pi

I cloned the LuxTTS repository, set up a virtual environment, and installed the dependencies (PyTorch, LinaCodec, piper_phonemize, etc.). The first inference attempt crashed:

Illegal instruction (core dumped)

The pre‑built PyTorch wheels use NEON/SIMD instructions not available on my Pi’s ARM processor, resulting in a SIGILL. Recompiling PyTorch from source on the Pi would take many hours, which I wasn’t willing to do.

Conclusion: Keep cloud TTS on the Pi and move on.

Attempt 2: A New x86 Machine

I migrated to a home server (HP EliteDesk 800 G3, Intel i5, 8 GB RAM) with no NVIDIA GPU. LuxTTS supports a CPU‑only inference path, so I tried it there. The installation succeeded and the inference ran without SIGILL:

Generation time: 4.9s
Audio duration:  6.7s

That’s faster than real‑time on a modest mini‑PC and acceptable for home announcements.

Recording Reference Audio

LuxTTS requires a clean reference clip of at least 3 seconds. I recorded two samples:

  • A natural English sentence captured on a phone microphone.
  • A casual conversation snippet.

After experimenting, the configuration that sounded most natural was:

duration = 8     # target duration — affects pacing
rms = 0.01       # amplitude normalization
steps = 6        # diffusion steps — more = better quality, slower
speed = 0.9      # slightly slower than default sounds more natural
t_shift = 0.9    # tone shift

Default settings produced a robotic output; the tuned parameters came from roughly 20 trial iterations.

Integration with Google Home

The existing announce script used a fallback chain: cloud TTS → Piper (local rule‑based TTS). I inverted the order:

# Before: cloud_tts() → piper_fallback()
# After:  luxtts(voice_ref) → piper_fallback()

LuxTTS runs locally, generates a WAV file, and the script streams it to the Google Home speaker via catt. Total latency from trigger to playback is about 6–8 seconds, which is fine for family reminders.

What Actually Works

  • Morning wake‑up calls in the voice of the person who would normally deliver them.
  • Gentle apology messages when a previous wake‑up was too aggressive.
  • Bedtime reminders.

The cloned voice isn’t perfect—there’s a subtle uncanny‑valley effect on unfamiliar sentences—but for short, predictable phrases (“wake up, breakfast is ready”) it’s convincing enough to improve how the announcement lands.

What Doesn’t Work

  • Long sentences (quality degrades past ~15 words).
  • Non‑English phrases; the model wasn’t trained on code‑mixed speech, so Kannada‑English mixes become garbled.
  • Cold starts—the LuxTTS model loading takes ~8 seconds the first time. I keep it warm by running a silent inference at startup.

For Kannada‑specific messages, Sarvam Bulbul v3 remains the better choice; LuxTTS is English‑only at this point.

Architecture Overview

Cron trigger


announce.py
    ├── luxtts (local, voice‑cloned, English) ─────┐
    │   └── voices/reference.wav                    │
    └── piper (local, rule‑based, fallback)         │

                                          catt → Google Home

Takeaways

  • SIGILL is a PyTorch wheel problem, not a model problem. On ARM devices, verify that the wheel matches your instruction set before assuming the model is broken.
  • CPU‑only inference is viable for short audio. A 4.9 s generation time for a 6.7 s clip is perfectly acceptable for home automation; a GPU isn’t required.
  • Voice‑cloning configuration matters more than raw model quality. Default settings yield mediocre results; tweaking speed, duration, steps, etc., can dramatically improve output.
  • Build a fallback. LuxTTS can produce artifacts on unusual phoneme combinations. Having Piper as a fallback ensures the speaker always says something, even if quality varies.

The Google Home now sounds like home. That’s the win.

0 views
Back to Blog

Related posts

Read more »