I Built an AI Tutor That Actually Sees Your Homework — Here's How
Source: Dev.to
A few weeks ago I was watching my younger cousin struggle through a physics worksheet. She kept typing questions into ChatGPT, getting a wall of text back, and still looking confused. It hit me: why can’t she just show the problem to an AI and have it talk her through it like a real tutor would? That question became VisionSolve (SolveTutor), which I built for the Gemini Live Agent Challenge.
The Idea: What if AI Could See and Speak?
Most AI tutoring tools work through text boxes—you type your question, you get a text response. But that’s not how tutoring works in real life. A real tutor looks at your paper, listens to your confusion, and talks you through it step by step, adjusting when you’re lost.
I wanted to build exactly that—an AI tutor that:
- Sees your homework through your camera
- Listens to your questions through your microphone
- Speaks explanations back to you, naturally
No typing required. Just point your phone at a math problem and start talking.
Why Gemini Live API Was Perfect for This
I’d been experimenting with different LLM APIs, and when I found the Gemini Live API, it clicked immediately. Most APIs are request‑response—send text, get text back. Gemini’s Live API opens a persistent bidirectional stream where you can send audio and video frames continuously, and the model responds in real‑time audio.
Key features that made it ideal:
- Native audio output – the model produces audio directly, so explanations sound like a person talking, not a robot reading a script.
- Barge‑in support – students can interrupt (“wait, what?”) and the model stops, listens, and responds gracefully, without complex state management.
The Stack
Backend
- Google ADK (Agent Development Kit) – defined the agent with a system instruction, added tools, and let ADK handle session management and Google Search grounding.
- FastAPI + WebSockets – the frontend connects via WebSocket; the backend proxies audio/video to Gemini Live and streams audio responses back.
- Firebase Firestore – stores session transcripts for later review.
Frontend
- Next.js + TypeScript – a clean, mobile‑responsive interface with webcam feed, audio visualizer, and chat transcript.
- Firebase Auth – Google Sign‑In for authentication.
Model
gemini-2.5-flash-native-audio – the latest native‑audio model, fast enough for real‑time conversation and capable of understanding handwritten math from a shaky phone camera.
Things That Surprised Me
The vision capabilities are seriously good
I expected the model to struggle with messy handwritten math, but it didn’t. It correctly identified scribbled algebra, printed calculus, and even chemistry diagrams, handling angled cameras and poor lighting.
Natural interruptions just work
I was nervous about handling student interruptions. The Live API’s barge‑in support means the model automatically pauses, listens, and responds when the student talks—no complex state machine needed.
The hardest part wasn’t the AI
The AI side was smoother than expected thanks to ADK and the Live API. The real challenge was WebSocket audio streaming—browsers are finicky about microphone permissions, Safari behaves oddly, and overall it was classic web‑dev pain.
Google Cloud Deployment
- Backend runs on Cloud Run with Vertex AI integration.
- Frontend is hosted on Firebase Hosting.
- CI/CD pipeline via GitHub Actions: pushing a tag builds a Docker image, pushes it to GCR, deploys to Cloud Run, builds the frontend with the new backend URL injected, and deploys to Firebase. The whole workflow takes about 4 minutes.
You can view the full pipeline here.
What I’d Do Differently
If I had more time, I’d add:
- Drawing/annotation support – let Sol highlight parts of the image while explaining.
- Progress tracking – monitor which topics the student struggles with over time.
Try It Out
The project is open source: github.com/dev-phantom/VisionSolve.
The README includes full setup instructions for running locally; you’ll need a Gemini API key (free from Google AI Studio) and a Firebase project.
If you’re thinking about building something with the Gemini Live API, go for it. Real‑time audio + vision opens up use cases that weren’t possible with traditional request‑response APIs.
Built for the #GeminiLiveAgentChallenge using Google Gemini, ADK, Firebase, and Cloud Run.