Building OmniGuide AI — A Real-Time Visual Assistant with Gemini Live
Source: Dev.to
Introduction
What if AI could see what you see and guide you in real time?
That idea led to the creation of OmniGuide AI, a real‑time multimodal assistant powered by the Gemini Live API and deployed using Google Cloud Run.
Instead of typing questions into a chatbot, users simply:
- Point their phone camera at a problem
- Ask a question using voice
- Receive live spoken guidance and visual overlays
OmniGuide acts like an expert standing beside you, helping with tasks such as repairing devices, cooking, learning, or troubleshooting.
This article explains how we built OmniGuide AI using Google AI models and Google Cloud for the #GeminiLiveAgentChallenge.
The Idea
Most AI assistants today require typing prompts, but real‑world problems happen in physical environments:
- Fixing a leaking pipe
- Understanding a device error
- Cooking a recipe
- Solving homework
OmniGuide AI bridges the gap by combining:
- Live camera input
- Voice interaction
- AI reasoning
- Real‑time guidance
Tech Stack
AI Model
Gemini 1.5 Flash – used for vision understanding, voice conversation, context reasoning, and real‑time instruction generation.
Streaming AI Interface
Gemini Live API – allows the app to process video frames, audio input, and real‑time prompts.
Backend Infrastructure
Google Cloud Run – provides scalable AI inference endpoints, fast container deployment, and low‑latency API routing.
Frontend
- WebRTC for camera streaming
- WebSockets for real‑time AI responses
- React for UI
- Canvas overlays for visual guidance
Architecture
High‑level system flow:
- User opens OmniGuide.
- Camera stream begins.
- Voice input captured.
- Frames + audio sent to Gemini Live API.
- Gemini analyzes the scene.
- AI generates instructions.
- Voice response + overlay returned.
Result: AI guidance in real time.
Key Features
Real‑Time Visual Understanding
Gemini analyzes live camera frames to understand objects and environments.
Voice Interaction
Users can simply ask, for example:
- “What is this error?”
- “How do I fix this?”
Step‑by‑Step Guidance
The AI provides instructions such as:
- Pointing to the correct component
- Highlighting objects
- Describing the next step
Visual Overlays
On‑screen guides help users follow instructions easily.
Example Use Cases
- Home Repair – Point the camera at a leaking pipe and ask, “How do I fix this?”
- Cooking – Show ingredients and ask, “What can I cook with these?”
- Education – Students can show math problems or experiments.
- Device Troubleshooting – Scan error messages and get solutions instantly.
Challenges We Faced
Real‑Time Latency
Handling live video + AI inference required careful optimization.
We solved this by:
- Compressing frames
- Streaming only key frames
- Using Gemini Flash for faster responses
Multimodal Context
Ensuring Gemini correctly interprets visual context required structured prompts and scene summaries.
What Makes OmniGuide Unique
OmniGuide transforms AI from a chat interface into a real‑time expert assistant. Instead of searching online tutorials, users simply show the problem and ask for help.
What’s Next
Future improvements include:
- AR overlays
- Smart object detection
- Multi‑step task memory
- Collaborative remote assistance
Conclusion
OmniGuide AI demonstrates how Google AI models and Google Cloud can power the next generation of multimodal live agents. By combining vision, voice, and reasoning, we move beyond chatbots into AI that understands the physical world.
This article was created for the purposes of entering the #GeminiLiveAgentChallenge.