Build a Vision AI Agent with Gemini 3 in < 3 Minutes

Published: (December 3, 2025 at 11:49 AM EST)
3 min read
Source: Dev.to

Source: Dev.to

Stream released support for Google’s new Gemini 3 models inside Vision Agents — the open‑source Python framework for building real‑time voice and video AI applications.

In this 3‑minute video demo, you’ll see how to spin up a fully functional vision‑enabled voice agent that can see your screen (or webcam), reason with Gemini 3 Pro Preview, and talk back to you naturally, all in pure Python.

What You’ll Learn

  • Install Vision Agents (GitHub repo) + the new Gemini plugin
  • Use gemini-3-pro-preview as your LLM with a single line
  • Build a live video‑call agent that can see and describe anything on your screen in real time
  • Customize reasoning depth (low/high thinking level)

Get Started in 60 Seconds

  1. Create a fresh project (we recommend uv).

    # Initialize a new Python project
    uv init
    
    # Activate your environment
    uv venv && source .venv/bin/activate
  2. Install Vision Agents + required plugins.

    # Install Vision Agents
    uv add vision-agents
    
    # Install required plugins
    uv add "vision-agents[getstream, gemini, elevenlabs, deepgram, smart-turn]"

You’ll also need:

  • A free Gemini API key
  • A free Stream account (for the video call UI) →

Minimal Working Example

Rename your main.py to gemini_vision_demo.py and replace its content with the sample code below.

import asyncio
import logging

from dotenv import load_dotenv
from vision_agents.core import User, Agent, cli
from vision_agents.core.agents import AgentLauncher
from vision_agents.plugins import elevenlabs, getstream, smart_turn, gemini, deepgram

logger = logging.getLogger(__name__)

load_dotenv()

async def create_agent(**kwargs) -> Agent:
    """Create the agent with Inworld AI TTS."""
    agent = Agent(
        edge=getstream.Edge(),
        agent_user=User(name="Friendly AI", id="agent"),
        instructions=(
            "You are a friendly AI assistant powered by Gemini 3. "
            "You are able to answer questions and help with tasks. "
            "You carefully observe a users' camera feed and respond to their questions and tasks."
        ),
        tts=elevenlabs.TTS(),
        stt=deepgram.STT(),
        # Gemini 3 model
        llm=gemini.LLM("gemini-3-pro-preview"),
        turn_detection=smart_turn.TurnDetection(),
    )
    return agent

async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    """Join the call and start the agent."""
    await agent.create_user()
    call = await agent.create_call(call_type, call_id)

    logger.info("🤖 Starting Inworld AI Agent...")

    with await agent.join(call):
        logger.info("Joining call")
        logger.info("LLM ready")

        await asyncio.sleep(5)
        await agent.llm.simple_response(text="Describe what you currently see")
        await agent.finish()  # Run till the call ends

if __name__ == "__main__":
    cli(AgentLauncher(create_agent=create_agent, join_call=join_call))

Run it:

uv run gemini_vision_demo.py

A browser tab opens with a Stream video call. Click “Join call”, grant camera/mic/screen permissions, and say something like:

“Okay, I’m going to share my screen — tell me what you see!”

Gemini 3 will instantly analyze your screen and respond with detailed descriptions in a natural spoken voice.

Gemini 3 brings better reasoning and multimodal understanding, and Vision Agents makes it simple to turn that power into interactive voice/video experiences. No React, no WebRTC boilerplate—just Python.

Try it today! 🚀

Back to Blog

Related posts

Read more »

New Gemini API updates for Gemini 3

Gemini 3, our most intelligent model, is now available for developers via the Gemini API. To support its state‑of‑the‑art reasoning, autonomous coding, multimod...

New Gemini API updates for Gemini 3

Nov 25, 2025 What’s new in the Gemini API for Gemini 3 - Simplified parameters for thinking control: Starting with Gemini 3, a new thinking_level parameter lets...