Building Multi-Model AI Agents with OpenAI, Ollama, Groq and Gemini

Published: (March 1, 2026 at 05:23 PM EST)
7 min read
Source: Dev.to

Source: Dev.to

Introduction

Most AI applications today rely on a single LLM provider. That works fine until the API goes down, rate limits are hit, or your costs spiral out of control. A better approach is to build agents that can orchestrate multiple models and switch between them based on the task at hand.

In this article I will walk through how I built an AI‑agent framework that supports:

  • OpenAI GPT‑4 – best reasoning and function‑calling
  • Ollama – runs locally with zero latency and no API costs
  • Groq – sub‑200 ms inference for real‑time applications
  • Google Gemini – excels at multimodal tasks (vision, audio, code)

By abstracting the provider layer, your agent can:

  • Pick the right model for each sub‑task
  • Fall back gracefully when one provider fails
  • Optimize cost by routing simple tasks to cheaper models

Architecture Overview

The framework has four main components:

Agent Core → Planning → Tool Execution → Memory
     |            |            |              |
  LLM Router   Task Graph   Registry     Redis / PostgreSQL
     |
  OpenAI | Ollama | Groq | Gemini

The LLM Router is the key piece. It decides which provider handles each request based on configurable rules.


1️⃣ Common Provider Interface

from abc import ABC, abstractmethod
from dataclasses import dataclass

@dataclass
class LLMResponse:
    content: str
    model: str
    provider: str
    tokens_used: int
    latency_ms: float

class LLMProvider(ABC):
    @abstractmethod
    async def complete(self, messages: list, tools: list | None = None) -> LLMResponse:
        """Generate a completion for the given messages."""
        ...

    @abstractmethod
    async def embed(self, text: str) -> list[float]:
        """Return an embedding vector for the supplied text."""
        ...

All concrete providers implement this interface.


2️⃣ Provider Implementations

OpenAI

import time, os
import openai
import logging

logger = logging.getLogger(__name__)

class OpenAIProvider(LLMProvider):
    def __init__(self, model: str = "gpt-4"):
        self.client = openai.AsyncOpenAI()
        self.model = model

    async def complete(self, messages, tools=None):
        start = time.monotonic()
        response = await self.client.chat.completions.create(
            model=self.model,
            messages=messages,
            tools=tools,
        )
        latency = (time.monotonic() - start) * 1000
        return LLMResponse(
            content=response.choices[0].message.content,
            model=self.model,
            provider="openai",
            tokens_used=response.usage.total_tokens,
            latency_ms=latency,
        )

    async def embed(self, text: str):
        # Example placeholder – replace with actual embedding call
        raise NotImplementedError

Ollama

import time
import ollama

class OllamaProvider(LLMProvider):
    def __init__(self, model: str = "llama3"):
        self.model = model

    async def complete(self, messages, tools=None):
        start = time.monotonic()
        response = await ollama.AsyncClient().chat(
            model=self.model,
            messages=messages,
        )
        latency = (time.monotonic() - start) * 1000
        return LLMResponse(
            content=response["message"]["content"],
            model=self.model,
            provider="ollama",
            tokens_used=response.get("eval_count", 0),
            latency_ms=latency,
        )

    async def embed(self, text: str):
        raise NotImplementedError

Groq

import time
from groq import Groq

class GroqProvider(LLMProvider):
    def __init__(self, model: str = "llama3-70b-8192"):
        self.client = Groq()
        self.model = model

    async def complete(self, messages, tools=None):
        start = time.monotonic()
        response = self.client.chat.completions.create(
            model=self.model,
            messages=messages,
        )
        latency = (time.monotonic() - start) * 1000
        return LLMResponse(
            content=response.choices[0].message.content,
            model=self.model,
            provider="groq",
            tokens_used=response.usage.total_tokens,
            latency_ms=latency,
        )

    async def embed(self, text: str):
        raise NotImplementedError

Gemini

import os, time
import google.generativeai as genai

class GeminiProvider(LLMProvider):
    def __init__(self, model: str = "gemini-pro"):
        genai.configure(api_key=os.getenv("GEMINI_API_KEY"))
        self.model = genai.GenerativeModel(model)

    async def complete(self, messages, tools=None):
        start = time.monotonic()
        # Gemini currently only supports a single‑turn prompt; we send the last message.
        response = await self.model.generate_content_async(messages[-1]["content"])
        latency = (time.monotonic() - start) * 1000
        return LLMResponse(
            content=response.text,
            model="gemini-pro",
            provider="gemini",
            tokens_used=0,          # Gemini usage stats not yet exposed
            latency_ms=latency,
        )

    async def embed(self, text: str):
        raise NotImplementedError

3️⃣ LLM Router

import logging

logger = logging.getLogger(__name__)

class LLMRouter:
    def __init__(self, providers: dict[str, LLMProvider]):
        """
        providers: mapping from provider name (e.g., "openai") to an LLMProvider instance.
        """
        self.providers = providers
        self.fallback_order = ["openai", "groq", "ollama", "gemini"]

    async def route(self, messages, task_type: str = "general", tools=None) -> LLMResponse:
        """Select a provider based on task_type and attempt completion with fallbacks."""
        primary = self._select_provider(task_type)

        for name in self._fallback_chain(primary):
            try:
                provider = self.providers[name]
                return await provider.complete(messages, tools)
            except Exception as e:
                logger.warning(f"{name} failed: {e}; trying next provider")
                continue

        raise RuntimeError("All providers failed")

    def _select_provider(self, task_type: str) -> str:
        routing_rules = {
            "reasoning": "openai",
            "realtime":  "groq",
            "local":     "ollama",
            "vision":    "gemini",
            "general":   "openai",
        }
        return routing_rules.get(task_type, "openai")

    def _fallback_chain(self, primary: str) -> list[str]:
        """Return a list starting with the primary provider followed by the rest of the fallback order."""
        chain = [primary]
        for name in self.fallback_order:
            if name != primary:
                chain.append(name)
        return chain

4️⃣ Agent Core

class Agent:
    def __init__(self, router: LLMRouter, tools: "ToolRegistry", memory: "Memory"):
        self.router = router
        self.tools = tools
        self.memory = memory

    async def execute(self, task: str) -> str:
        """
        High‑level entry point:
        1. Retrieve relevant context from memory.
        2. Build a planning graph.
        3. Use the router to get LLM responses.
        4. Invoke tools as needed.
        """
        # Placeholder – fill in with your planning / tool‑execution logic
        raise NotImplementedError

(The implementation of ToolRegistry and Memory is omitted for brevity.)


🎯 Takeaways

  • Abstraction – A thin, common interface lets you swap providers without touching the rest of the code.
  • Routing + Fallback – Choose the best model for the job and automatically recover from outages.
  • Cost & Latency Optimization – Route cheap, fast tasks to Ollama or Groq, and reserve GPT‑4 for heavy reasoning.

With this pattern you can build robust, multi‑model agents that stay responsive and cost‑effective even when a single provider experiences trouble. Happy building!

Agent Execution Flow

tr:
    context = await self.memory.get_relevant(task)

    messages = [
        {"role": "system", "content": self._build_system_prompt(context)},
        {"role": "user",   "content": task}
    ]

    while True:
        response = await self.router.route(
            messages,
            task_type=self._classify_task(task),
            tools=self.tools.get_schemas()
        )

        if not response.has_tool_calls:
            break

        tool_results = await self.tools.execute(response.tool_calls)
        messages.extend(tool_results)

    await self.memory.store(task, response.content)
    return response.content

Tool Registry

Tools give the agent the ability to interact with external systems:

class ToolRegistry:
    def __init__(self):
        self._tools = {}

    def register(self, name: str, func, schema: dict):
        self._tools[name] = {"func": func, "schema": schema}

    async def execute(self, tool_calls):
        results = []
        for call in tool_calls:
            tool = self._tools[call.name]
            result = await tool["func"](**call.arguments)
            results.append({
                "role": "tool",
                "content": str(result),
                "tool_call_id": call.id
            })
        return results

    @classmethod
    def default(cls):
        registry = cls()
        registry.register("web_search",   web_search,   web_search_schema)
        registry.register("code_execute", code_execute, code_execute_schema)
        registry.register("file_read",    file_read,    file_read_schema)
        return registry

Provider Configuration

providers = {
    "openai":  OpenAIProvider("gpt-4"),
    "ollama":  OllamaProvider("llama3"),
    "groq":    GroqProvider("llama3-70b-8192"),
    "gemini":  GeminiProvider("gemini-pro")
}

router = LLMRouter(providers)
tools   = ToolRegistry.default()
memory  = RedisMemory(url="redis://localhost:6379")

agent = Agent(router=router, tools=tools, memory=memory)

result = await agent.execute(
    "Analyze the performance bottlenecks in our API and suggest fixes"
)

Multi‑Model Routing for Cost Control

One of the biggest benefits of multi‑model routing is cost control. Below is a practical routing strategy:

Task TypeProviderCost per 1 M tokens
Complex reasoningOpenAI GPT‑4$30
Simple Q&AGroq LLaMA 3$0.59
Code generationOllama (local)$0
Image analysisGemini Pro$0.50

Result: By routing 70 % of requests to Groq/Ollama and reserving GPT‑4 for complex tasks, we reduced our monthly AI costs by 80 %.

Lessons Learned

  • Provider abstraction pays off fast. When one API experiences an outage, the system keeps running.
  • Latency varies wildly. Groq averages ~200 ms vs. OpenAI’s 1–2 s, which makes a real difference for interactive applications.
  • Local models are underrated. Ollama with LLaMA 3 handles ~80 % of tasks without any external API calls.
  • Memory is the hard part. Deciding what to remember and what to forget matters more than which model you use.

Source Code & Community

The full source code is available on GitHub: ai-agent-framework

If you are building AI agents or working with multiple LLM providers, I’d love to hear about your approach. Drop a comment below or connect with me on GitHub.

0 views
Back to Blog

Related posts

Read more »

Google Gemini Writing Challenge

What I Built - Where Gemini fit in - Used Gemini’s multimodal capabilities to let users upload screenshots of notes, diagrams, or code snippets. - Gemini gener...