What Building Voxitale for the Gemini Live Contest Taught Me About Working With Multiple AI Tools

Published: 1 month ago (March 13, 2026 at 12:41 AM EDT)

5 min read

Source: Dev.to

Source: Dev.to

Voxitale – A Voice‑First Storytelling App

Built for the Gemini Live contest

What is Voxitale?

Audience: Young children
Interaction: A child talks directly to a character named Amelia in the browser.
Experience:
- The child guides the adventure out loud.
- Illustrated scenes appear as the story unfolds.
- At the end, the system produces a short story‑book‑style movie based on the session.

The “Strange” Part

My favorite moment during the entire project was fixing the Wi‑Fi on my Raspberry Pi.

Let me explain.

My Philosophy

I hate consultant‑style talk—polished language that sounds impressive but says nothing.
So I won’t pretend this was an elegant engineering journey. It was:

Messy
Fast
Tool‑heavy (a pile of AI tools)

Some parts were genuinely exciting; others felt like moving logs between terminals for hours.
Nevertheless, I actually learned something useful.

The Experience (User‑Facing)

Voice Interaction – A child speaks to Amelia; the system generates illustrated scenes and narration in real time.
Story Progression – Pages are generated on‑the‑fly and later assembled into a story‑book‑style experience.
Parental Controls – Mood, pacing, narrator voice, optional smart‑lighting effects.

Goal: Make storytelling interactive instead of passive.

Why Gemini Live?

I entered the Gemini Live contest to have an excuse to build something around live interaction.

Previous work: A Gemini Live‑powered customer‑service prototype (RAG lookups, site navigation, video control).
Result: It worked, but it didn’t excite me.
New focus: Interactive storytelling, which demands presence, quick responses, interruption handling, and sustained illusion.

A contract fell through, giving me the one thing most side projects never get: uninterrupted time. I thought I could manage because I’d already touched Gemini Live—I was wrong. Real‑time storytelling is much harder than it looks.

System Overview

Voxitale became a system with two very different tempos running simultaneously:

Tempo	Description
Live conversation loop	Child speaks → audio streamed via WebSocket → FastAPI backend → Google ADK live agent (Gemini native audio) → Amelia responds in real time.
Creative generation pipeline	As the story evolves, the system generates illustrated scenes, captions, optional ElevenLabs narration, and Home Assistant lighting effects. At session end, everything is assembled into a short story‑book movie.

Key Requirements

Low‑latency voice interaction
Slower media generation

Architecture Diagram (textual)

+-------------------+        WebSocket        +-------------------+
|   Browser (React |  |   FastAPI (Cloud Run) |
|   / Next.js)      |                         |   - WS handling   |
|   - Audio worklet |                         |   - Orchestration |
|   - UI            |                         |                   |
+-------------------+                         +-------------------+
          |                                            |
          |                                            |
          v                                            v
+-------------------+        Gemini Live      +-------------------+
| Google ADK Agent  |  |  Gemini Live /    |
| (Vertex models)   |                         |  Vertex AI        |
+-------------------+                         +-------------------+
          |                                            |
          |   Media Generation (ElevenLabs,          |
          |   Image models, Home Assistant)          |
          v                                            v
+-------------------+        Storage        +-------------------+
| Cloud Storage     |  | Firestore (metadata) |
+-------------------+                         +-------------------+
          |
          v
+-------------------+
| Cloud Run Job     |
| (MP4 assembly)    |
+-------------------+

Frontend: React/Next.js captures microphone audio using Audio Worklets and streams it over WebSockets.
Backend: Google Cloud Run hosts FastAPI, managing WS connections, API routing, and session orchestration.
Agent Layer: Google ADK runs Gemini Live + Vertex models, handling storytelling logic, prompt rules, and tool execution.
Assets: Generated scenes & assets → Google Cloud Storage; session metadata & feedback → Firestore.
Final Output: Cloud Run job assembles everything into an MP4 story‑book video.

Development Workflow & AI Tools

Tool	Role
Google Anti‑Gravity (Gemini Pro / Flash)	Front‑end UI ideas, feature brainstorming
OpenAI Codex (GPT‑5.4)	Majority of backend work, debugging
Anthropic Opus & Sonnet	Early‑stage development, prototyping
Gemini Live	Powered the product experience (not the code)

I basically vibe‑coded large parts of the system. AI coding tools are only as good as the context you give them.

Context Management

Pulled documentation (WebSockets, Gemini Live, Google ADK, reconnect logic, streaming pipelines) into a docs/ folder so models could reference it.
Logging became critical.

Debugging Loop

Explain the issue.
Provide backend logs.
Provide frontend logs.
Let the model analyze the failure.
Test the fix.

AI made debugging faster, but it was still debugging.

The Hardest Part: System Stability

When people hear “interactive storyteller,” they imagine the fun bits:

Character voices
Illustrations
Kids guiding the plot

Reality: The real work is everything underneath. From an architecture perspective, two systems must stay synchronized:

Real‑time conversational system
Creative media generation pipeline

If they drift apart, the experience collapses.

The Raspberry Pi Wi‑Fi Saga

Needed an old Raspberry Pi for the Home Assistant integration.
After upgrading the Pi, Wi‑Fi stopped working.
Spent ~4 hours debugging.
Discovered the issue: running a 32‑bit OS instead of the required 64‑bit version.

Fixing that Wi‑Fi problem became my favorite moment—proof that even a tiny, messy detail can feel like a triumph in a chaotic project.

Takeaways

Real‑time interactive storytelling is a blend of low‑latency voice pipelines and heavyweight media generation.
AI coding assistants accelerate development, but solid documentation and logging are still essential.
System synchronization is the linchpin; any drift quickly breaks the user experience.

What Building Voxitale for the Gemini Live Contest Taught Me About Working With Multiple AI Tools

Voxitale – A Voice‑First Storytelling App

What is Voxitale?

The “Strange” Part

My Philosophy

The Experience (User‑Facing)

Why Gemini Live?

System Overview

Key Requirements

Architecture Diagram (textual)

Development Workflow & AI Tools

Context Management

Debugging Loop

The Hardest Part: System Stability

The Raspberry Pi Wi‑Fi Saga

Takeaways

Related posts

Why Open Source AI Tools Are Quietly Winning

Travigo

Trust Debt: The Production Crisis Hidden Inside AI-Generated Codebases

Micro games

Voxitale – A Voice‑First Storytelling App

What is Voxitale?

The “Strange” Part

My Philosophy

The Experience (User‑Facing)

Why Gemini Live?

System Overview

Key Requirements

Architecture Diagram (textual)

Development Workflow & AI Tools

Context Management

Debugging Loop

The Hardest Part: System Stability

The Raspberry Pi Wi‑Fi Saga

Takeaways

Related posts

Why Open Source AI Tools Are Quietly Winning

Travigo

Trust Debt: The Production Crisis Hidden Inside AI-Generated Codebases

Micro games

Why Gemini Live?

The Raspberry Pi Wi‑Fi Saga