使用 Gemini 3 在不到 3 分钟内构建 Vision AI Agent

发布: 2个月前 (2025年12月4日 GMT+8 00:49)

4 分钟阅读

原文: Dev.to

Source: Dev.to

Stream 为 Google 新推出的 Gemini 3 模型在 Vision Agents 中提供了支持——这是一个用于构建实时语音和视频 AI 应用的开源 Python 框架。

在这段 3 分钟的视频演示中，你将看到如何快速启动一个完整功能的视觉增强语音代理，它可以看到你的屏幕（或摄像头），使用 Gemini 3 Pro Preview 进行推理，并以自然的方式与你对话，全部使用纯 Python 实现。

你将学到

安装 Vision Agents（GitHub 仓库）+ 新的 Gemini 插件
只需一行代码使用 gemini-3-pro-preview 作为你的 LLM
构建一个实时视频通话代理，能够实时看到并描述屏幕上的任何内容
自定义推理深度（低/高思考层级）

60 秒快速上手

创建一个全新的项目（我们推荐使用 uv）。

# 初始化一个新的 Python 项目
uv init

# 激活你的环境
uv venv && source .venv/bin/activate

安装 Vision Agents + 必要插件。

# 安装 Vision Agents
uv add vision-agents

# 安装必需的插件
uv add "vision-agents[getstream, gemini, elevenlabs, deepgram, smart-turn]"

你还需要：

一个免费的 Gemini API 密钥 →
一个免费的 Stream 账户（用于视频通话 UI） →

最小可运行示例

将你的 main.py 重命名为 gemini_vision_demo.py，并用下面的示例代码替换其内容。

import asyncio
import logging

from dotenv import load_dotenv
from vision_agents.core import User, Agent, cli
from vision_agents.core.agents import AgentLauncher
from vision_agents.plugins import elevenlabs, getstream, smart_turn, gemini, deepgram

logger = logging.getLogger(__name__)

load_dotenv()

async def create_agent(**kwargs) -> Agent:
    """Create the agent with Inworld AI TTS."""
    agent = Agent(
        edge=getstream.Edge(),
        agent_user=User(name="Friendly AI", id="agent"),
        instructions=(
            "You are a friendly AI assistant powered by Gemini 3. "
            "You are able to answer questions and help with tasks. "
            "You carefully observe a users' camera feed and respond to their questions and tasks."
        ),
        tts=elevenlabs.TTS(),
        stt=deepgram.STT(),
        # Gemini 3 model
        llm=gemini.LLM("gemini-3-pro-preview"),
        turn_detection=smart_turn.TurnDetection(),
    )
    return agent

async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
    """Join the call and start the agent."""
    await agent.create_user()
    call = await agent.create_call(call_type, call_id)

    logger.info("🤖 Starting Inworld AI Agent...")

    with await agent.join(call):
        logger.info("Joining call")
        logger.info("LLM ready")

        await asyncio.sleep(5)
        await agent.llm.simple_response(text="Describe what you currently see")
        await agent.finish()  # Run till the call ends

if __name__ == "__main__":
    cli(AgentLauncher(create_agent=create_agent, join_call=join_call))

运行它：

uv run gemini_vision_demo.py

浏览器会打开一个 Stream 视频通话页面。点击 “Join call”，授予摄像头/麦克风/屏幕权限，然后说类似下面的话：

“好的，我要共享我的屏幕——告诉我你看到了什么！”

Gemini 3 将立即分析你的屏幕，并以自然的语音给出详细描述。

链接与资源

Gemini 3 带来了更强的推理和多模态理解能力，而 Vision Agents 则让你可以轻松将这些能力转化为交互式语音/视频体验。无需 React，无需 WebRTC 样板代码——只需 Python。

今天就试试吧！ 🚀

使用 Gemini 3 在不到 3 分钟内构建 Vision AI Agent

你将学到

60 秒快速上手

最小可运行示例

链接与资源

相关文章

Gemini 3 的新 Gemini API 更新

从基础到突破：我在 Kaggle Google AI Agents 强化训练中的旅程

Gemini 3 的新 Gemini API 更新

🚀 Gemini 3 正在改变 AI 版图——OpenAI 已感受到它