What Building Voxitale for the Gemini Live Contest Taught Me About Working With Multiple AI Tools

Published: (March 13, 2026 at 12:41 AM EDT)
5 min read
Source: Dev.to

Source: Dev.to

Voxitale – A Voice‑First Storytelling App

Built for the Gemini Live contest

What is Voxitale?

  • Audience: Young children
  • Interaction: A child talks directly to a character named Amelia in the browser.
  • Experience:
    • The child guides the adventure out loud.
    • Illustrated scenes appear as the story unfolds.
    • At the end, the system produces a short story‑book‑style movie based on the session.

The “Strange” Part

My favorite moment during the entire project was fixing the Wi‑Fi on my Raspberry Pi.

Let me explain.

My Philosophy

I hate consultant‑style talk—polished language that sounds impressive but says nothing.
So I won’t pretend this was an elegant engineering journey. It was:

  • Messy
  • Fast
  • Tool‑heavy (a pile of AI tools)

Some parts were genuinely exciting; others felt like moving logs between terminals for hours.
Nevertheless, I actually learned something useful.

The Experience (User‑Facing)

  1. Voice Interaction – A child speaks to Amelia; the system generates illustrated scenes and narration in real time.
  2. Story Progression – Pages are generated on‑the‑fly and later assembled into a story‑book‑style experience.
  3. Parental Controls – Mood, pacing, narrator voice, optional smart‑lighting effects.

Goal: Make storytelling interactive instead of passive.

Why Gemini Live?

I entered the Gemini Live contest to have an excuse to build something around live interaction.

  • Previous work: A Gemini Live‑powered customer‑service prototype (RAG lookups, site navigation, video control).
  • Result: It worked, but it didn’t excite me.
  • New focus: Interactive storytelling, which demands presence, quick responses, interruption handling, and sustained illusion.

A contract fell through, giving me the one thing most side projects never get: uninterrupted time. I thought I could manage because I’d already touched Gemini Live—I was wrong. Real‑time storytelling is much harder than it looks.

System Overview

Voxitale became a system with two very different tempos running simultaneously:

TempoDescription
Live conversation loopChild speaks → audio streamed via WebSocket → FastAPI backend → Google ADK live agent (Gemini native audio) → Amelia responds in real time.
Creative generation pipelineAs the story evolves, the system generates illustrated scenes, captions, optional ElevenLabs narration, and Home Assistant lighting effects. At session end, everything is assembled into a short story‑book movie.

Key Requirements

  • Low‑latency voice interaction
  • Slower media generation

Architecture Diagram (textual)

+-------------------+        WebSocket        +-------------------+
|   Browser (React |  |   FastAPI (Cloud Run) |
|   / Next.js)      |                         |   - WS handling   |
|   - Audio worklet |                         |   - Orchestration |
|   - UI            |                         |                   |
+-------------------+                         +-------------------+
          |                                            |
          |                                            |
          v                                            v
+-------------------+        Gemini Live      +-------------------+
| Google ADK Agent  |  |  Gemini Live /    |
| (Vertex models)   |                         |  Vertex AI        |
+-------------------+                         +-------------------+
          |                                            |
          |   Media Generation (ElevenLabs,          |
          |   Image models, Home Assistant)          |
          v                                            v
+-------------------+        Storage        +-------------------+
| Cloud Storage     |  | Firestore (metadata) |
+-------------------+                         +-------------------+
          |
          v
+-------------------+
| Cloud Run Job     |
| (MP4 assembly)    |
+-------------------+
  • Frontend: React/Next.js captures microphone audio using Audio Worklets and streams it over WebSockets.
  • Backend: Google Cloud Run hosts FastAPI, managing WS connections, API routing, and session orchestration.
  • Agent Layer: Google ADK runs Gemini Live + Vertex models, handling storytelling logic, prompt rules, and tool execution.
  • Assets: Generated scenes & assets → Google Cloud Storage; session metadata & feedback → Firestore.
  • Final Output: Cloud Run job assembles everything into an MP4 story‑book video.

Development Workflow & AI Tools

ToolRole
Google Anti‑Gravity (Gemini Pro / Flash)Front‑end UI ideas, feature brainstorming
OpenAI Codex (GPT‑5.4)Majority of backend work, debugging
Anthropic Opus & SonnetEarly‑stage development, prototyping
Gemini LivePowered the product experience (not the code)

I basically vibe‑coded large parts of the system. AI coding tools are only as good as the context you give them.

Context Management

  • Pulled documentation (WebSockets, Gemini Live, Google ADK, reconnect logic, streaming pipelines) into a docs/ folder so models could reference it.
  • Logging became critical.

Debugging Loop

  1. Explain the issue.
  2. Provide backend logs.
  3. Provide frontend logs.
  4. Let the model analyze the failure.
  5. Test the fix.

AI made debugging faster, but it was still debugging.

The Hardest Part: System Stability

When people hear “interactive storyteller,” they imagine the fun bits:

  • Character voices
  • Illustrations
  • Kids guiding the plot

Reality: The real work is everything underneath. From an architecture perspective, two systems must stay synchronized:

  1. Real‑time conversational system
  2. Creative media generation pipeline

If they drift apart, the experience collapses.

The Raspberry Pi Wi‑Fi Saga

  • Needed an old Raspberry Pi for the Home Assistant integration.
  • After upgrading the Pi, Wi‑Fi stopped working.
  • Spent ~4 hours debugging.
  • Discovered the issue: running a 32‑bit OS instead of the required 64‑bit version.

Fixing that Wi‑Fi problem became my favorite moment—proof that even a tiny, messy detail can feel like a triumph in a chaotic project.

Takeaways

  • Real‑time interactive storytelling is a blend of low‑latency voice pipelines and heavyweight media generation.
  • AI coding assistants accelerate development, but solid documentation and logging are still essential.
  • System synchronization is the linchpin; any drift quickly breaks the user experience.
0 views
Back to Blog

Related posts

Read more »

Travigo

Travel as fast as you speak with Gemini! Where live agents meet immersive storytelling & 3D navigation. This project was created for entering the Gemini Live Ag...

Micro games

Hey Gamers! 👾 As part of the Rapid Games Prototyping module, we are tasked with reviewing a peer's game. The challenge is to analyse a prototype built in just...