Build a Vision AI Agent with Gemini 3 in < 3 Minutes
Source: Dev.to
Stream released support for Google’s new Gemini 3 models inside Vision Agents — the open‑source Python framework for building real‑time voice and video AI applications.
In this 3‑minute video demo, you’ll see how to spin up a fully functional vision‑enabled voice agent that can see your screen (or webcam), reason with Gemini 3 Pro Preview, and talk back to you naturally, all in pure Python.
What You’ll Learn
- Install Vision Agents (GitHub repo) + the new Gemini plugin
- Use
gemini-3-pro-previewas your LLM with a single line - Build a live video‑call agent that can see and describe anything on your screen in real time
- Customize reasoning depth (low/high thinking level)
Get Started in 60 Seconds
-
Create a fresh project (we recommend
uv).# Initialize a new Python project uv init # Activate your environment uv venv && source .venv/bin/activate -
Install Vision Agents + required plugins.
# Install Vision Agents uv add vision-agents # Install required plugins uv add "vision-agents[getstream, gemini, elevenlabs, deepgram, smart-turn]"
You’ll also need:
- A free Gemini API key →
- A free Stream account (for the video call UI) →
Minimal Working Example
Rename your main.py to gemini_vision_demo.py and replace its content with the sample code below.
import asyncio
import logging
from dotenv import load_dotenv
from vision_agents.core import User, Agent, cli
from vision_agents.core.agents import AgentLauncher
from vision_agents.plugins import elevenlabs, getstream, smart_turn, gemini, deepgram
logger = logging.getLogger(__name__)
load_dotenv()
async def create_agent(**kwargs) -> Agent:
"""Create the agent with Inworld AI TTS."""
agent = Agent(
edge=getstream.Edge(),
agent_user=User(name="Friendly AI", id="agent"),
instructions=(
"You are a friendly AI assistant powered by Gemini 3. "
"You are able to answer questions and help with tasks. "
"You carefully observe a users' camera feed and respond to their questions and tasks."
),
tts=elevenlabs.TTS(),
stt=deepgram.STT(),
# Gemini 3 model
llm=gemini.LLM("gemini-3-pro-preview"),
turn_detection=smart_turn.TurnDetection(),
)
return agent
async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
"""Join the call and start the agent."""
await agent.create_user()
call = await agent.create_call(call_type, call_id)
logger.info("🤖 Starting Inworld AI Agent...")
with await agent.join(call):
logger.info("Joining call")
logger.info("LLM ready")
await asyncio.sleep(5)
await agent.llm.simple_response(text="Describe what you currently see")
await agent.finish() # Run till the call ends
if __name__ == "__main__":
cli(AgentLauncher(create_agent=create_agent, join_call=join_call))
Run it:
uv run gemini_vision_demo.py
A browser tab opens with a Stream video call. Click “Join call”, grant camera/mic/screen permissions, and say something like:
“Okay, I’m going to share my screen — tell me what you see!”
Gemini 3 will instantly analyze your screen and respond with detailed descriptions in a natural spoken voice.
Links & Resources
Gemini 3 brings better reasoning and multimodal understanding, and Vision Agents makes it simple to turn that power into interactive voice/video experiences. No React, no WebRTC boilerplate—just Python.
Try it today! 🚀