What Building Voxitale for the Gemini Live Contest Taught Me About Working With Multiple AI Tools
Source: Dev.to
Voxitale – A Voice‑First Storytelling App
Built for the Gemini Live contest
What is Voxitale?
- Audience: Young children
- Interaction: A child talks directly to a character named Amelia in the browser.
- Experience:
- The child guides the adventure out loud.
- Illustrated scenes appear as the story unfolds.
- At the end, the system produces a short story‑book‑style movie based on the session.
The “Strange” Part
My favorite moment during the entire project was fixing the Wi‑Fi on my Raspberry Pi.
Let me explain.
My Philosophy
I hate consultant‑style talk—polished language that sounds impressive but says nothing.
So I won’t pretend this was an elegant engineering journey. It was:
- Messy
- Fast
- Tool‑heavy (a pile of AI tools)
Some parts were genuinely exciting; others felt like moving logs between terminals for hours.
Nevertheless, I actually learned something useful.
The Experience (User‑Facing)
- Voice Interaction – A child speaks to Amelia; the system generates illustrated scenes and narration in real time.
- Story Progression – Pages are generated on‑the‑fly and later assembled into a story‑book‑style experience.
- Parental Controls – Mood, pacing, narrator voice, optional smart‑lighting effects.
Goal: Make storytelling interactive instead of passive.
Why Gemini Live?
I entered the Gemini Live contest to have an excuse to build something around live interaction.
- Previous work: A Gemini Live‑powered customer‑service prototype (RAG lookups, site navigation, video control).
- Result: It worked, but it didn’t excite me.
- New focus: Interactive storytelling, which demands presence, quick responses, interruption handling, and sustained illusion.
A contract fell through, giving me the one thing most side projects never get: uninterrupted time. I thought I could manage because I’d already touched Gemini Live—I was wrong. Real‑time storytelling is much harder than it looks.
System Overview
Voxitale became a system with two very different tempos running simultaneously:
| Tempo | Description |
|---|---|
| Live conversation loop | Child speaks → audio streamed via WebSocket → FastAPI backend → Google ADK live agent (Gemini native audio) → Amelia responds in real time. |
| Creative generation pipeline | As the story evolves, the system generates illustrated scenes, captions, optional ElevenLabs narration, and Home Assistant lighting effects. At session end, everything is assembled into a short story‑book movie. |
Key Requirements
- Low‑latency voice interaction
- Slower media generation
Architecture Diagram (textual)
+-------------------+ WebSocket +-------------------+
| Browser (React | | FastAPI (Cloud Run) |
| / Next.js) | | - WS handling |
| - Audio worklet | | - Orchestration |
| - UI | | |
+-------------------+ +-------------------+
| |
| |
v v
+-------------------+ Gemini Live +-------------------+
| Google ADK Agent | | Gemini Live / |
| (Vertex models) | | Vertex AI |
+-------------------+ +-------------------+
| |
| Media Generation (ElevenLabs, |
| Image models, Home Assistant) |
v v
+-------------------+ Storage +-------------------+
| Cloud Storage | | Firestore (metadata) |
+-------------------+ +-------------------+
|
v
+-------------------+
| Cloud Run Job |
| (MP4 assembly) |
+-------------------+- Frontend: React/Next.js captures microphone audio using Audio Worklets and streams it over WebSockets.
- Backend: Google Cloud Run hosts FastAPI, managing WS connections, API routing, and session orchestration.
- Agent Layer: Google ADK runs Gemini Live + Vertex models, handling storytelling logic, prompt rules, and tool execution.
- Assets: Generated scenes & assets → Google Cloud Storage; session metadata & feedback → Firestore.
- Final Output: Cloud Run job assembles everything into an MP4 story‑book video.
Development Workflow & AI Tools
| Tool | Role |
|---|---|
| Google Anti‑Gravity (Gemini Pro / Flash) | Front‑end UI ideas, feature brainstorming |
| OpenAI Codex (GPT‑5.4) | Majority of backend work, debugging |
| Anthropic Opus & Sonnet | Early‑stage development, prototyping |
| Gemini Live | Powered the product experience (not the code) |
I basically vibe‑coded large parts of the system. AI coding tools are only as good as the context you give them.
Context Management
- Pulled documentation (WebSockets, Gemini Live, Google ADK, reconnect logic, streaming pipelines) into a
docs/folder so models could reference it. - Logging became critical.
Debugging Loop
- Explain the issue.
- Provide backend logs.
- Provide frontend logs.
- Let the model analyze the failure.
- Test the fix.
AI made debugging faster, but it was still debugging.
The Hardest Part: System Stability
When people hear “interactive storyteller,” they imagine the fun bits:
- Character voices
- Illustrations
- Kids guiding the plot
Reality: The real work is everything underneath. From an architecture perspective, two systems must stay synchronized:
- Real‑time conversational system
- Creative media generation pipeline
If they drift apart, the experience collapses.
The Raspberry Pi Wi‑Fi Saga
- Needed an old Raspberry Pi for the Home Assistant integration.
- After upgrading the Pi, Wi‑Fi stopped working.
- Spent ~4 hours debugging.
- Discovered the issue: running a 32‑bit OS instead of the required 64‑bit version.
Fixing that Wi‑Fi problem became my favorite moment—proof that even a tiny, messy detail can feel like a triumph in a chaotic project.
Takeaways
- Real‑time interactive storytelling is a blend of low‑latency voice pipelines and heavyweight media generation.
- AI coding assistants accelerate development, but solid documentation and logging are still essential.
- System synchronization is the linchpin; any drift quickly breaks the user experience.