Building Multi-Model AI Agents with OpenAI, Ollama, Groq and Gemini
Source: Dev.to
Introduction
Most AI applications today rely on a single LLM provider. That works fine until the API goes down, rate limits are hit, or your costs spiral out of control. A better approach is to build agents that can orchestrate multiple models and switch between them based on the task at hand.
In this article I will walk through how I built an AI‑agent framework that supports:
- OpenAI GPT‑4 – best reasoning and function‑calling
- Ollama – runs locally with zero latency and no API costs
- Groq – sub‑200 ms inference for real‑time applications
- Google Gemini – excels at multimodal tasks (vision, audio, code)
By abstracting the provider layer, your agent can:
- Pick the right model for each sub‑task
- Fall back gracefully when one provider fails
- Optimize cost by routing simple tasks to cheaper models
Architecture Overview
The framework has four main components:
Agent Core → Planning → Tool Execution → Memory
| | | |
LLM Router Task Graph Registry Redis / PostgreSQL
|
OpenAI | Ollama | Groq | Gemini
The LLM Router is the key piece. It decides which provider handles each request based on configurable rules.
1️⃣ Common Provider Interface
from abc import ABC, abstractmethod
from dataclasses import dataclass
@dataclass
class LLMResponse:
content: str
model: str
provider: str
tokens_used: int
latency_ms: float
class LLMProvider(ABC):
@abstractmethod
async def complete(self, messages: list, tools: list | None = None) -> LLMResponse:
"""Generate a completion for the given messages."""
...
@abstractmethod
async def embed(self, text: str) -> list[float]:
"""Return an embedding vector for the supplied text."""
...
All concrete providers implement this interface.
2️⃣ Provider Implementations
OpenAI
import time, os
import openai
import logging
logger = logging.getLogger(__name__)
class OpenAIProvider(LLMProvider):
def __init__(self, model: str = "gpt-4"):
self.client = openai.AsyncOpenAI()
self.model = model
async def complete(self, messages, tools=None):
start = time.monotonic()
response = await self.client.chat.completions.create(
model=self.model,
messages=messages,
tools=tools,
)
latency = (time.monotonic() - start) * 1000
return LLMResponse(
content=response.choices[0].message.content,
model=self.model,
provider="openai",
tokens_used=response.usage.total_tokens,
latency_ms=latency,
)
async def embed(self, text: str):
# Example placeholder – replace with actual embedding call
raise NotImplementedError
Ollama
import time
import ollama
class OllamaProvider(LLMProvider):
def __init__(self, model: str = "llama3"):
self.model = model
async def complete(self, messages, tools=None):
start = time.monotonic()
response = await ollama.AsyncClient().chat(
model=self.model,
messages=messages,
)
latency = (time.monotonic() - start) * 1000
return LLMResponse(
content=response["message"]["content"],
model=self.model,
provider="ollama",
tokens_used=response.get("eval_count", 0),
latency_ms=latency,
)
async def embed(self, text: str):
raise NotImplementedError
Groq
import time
from groq import Groq
class GroqProvider(LLMProvider):
def __init__(self, model: str = "llama3-70b-8192"):
self.client = Groq()
self.model = model
async def complete(self, messages, tools=None):
start = time.monotonic()
response = self.client.chat.completions.create(
model=self.model,
messages=messages,
)
latency = (time.monotonic() - start) * 1000
return LLMResponse(
content=response.choices[0].message.content,
model=self.model,
provider="groq",
tokens_used=response.usage.total_tokens,
latency_ms=latency,
)
async def embed(self, text: str):
raise NotImplementedError
Gemini
import os, time
import google.generativeai as genai
class GeminiProvider(LLMProvider):
def __init__(self, model: str = "gemini-pro"):
genai.configure(api_key=os.getenv("GEMINI_API_KEY"))
self.model = genai.GenerativeModel(model)
async def complete(self, messages, tools=None):
start = time.monotonic()
# Gemini currently only supports a single‑turn prompt; we send the last message.
response = await self.model.generate_content_async(messages[-1]["content"])
latency = (time.monotonic() - start) * 1000
return LLMResponse(
content=response.text,
model="gemini-pro",
provider="gemini",
tokens_used=0, # Gemini usage stats not yet exposed
latency_ms=latency,
)
async def embed(self, text: str):
raise NotImplementedError
3️⃣ LLM Router
import logging
logger = logging.getLogger(__name__)
class LLMRouter:
def __init__(self, providers: dict[str, LLMProvider]):
"""
providers: mapping from provider name (e.g., "openai") to an LLMProvider instance.
"""
self.providers = providers
self.fallback_order = ["openai", "groq", "ollama", "gemini"]
async def route(self, messages, task_type: str = "general", tools=None) -> LLMResponse:
"""Select a provider based on task_type and attempt completion with fallbacks."""
primary = self._select_provider(task_type)
for name in self._fallback_chain(primary):
try:
provider = self.providers[name]
return await provider.complete(messages, tools)
except Exception as e:
logger.warning(f"{name} failed: {e}; trying next provider")
continue
raise RuntimeError("All providers failed")
def _select_provider(self, task_type: str) -> str:
routing_rules = {
"reasoning": "openai",
"realtime": "groq",
"local": "ollama",
"vision": "gemini",
"general": "openai",
}
return routing_rules.get(task_type, "openai")
def _fallback_chain(self, primary: str) -> list[str]:
"""Return a list starting with the primary provider followed by the rest of the fallback order."""
chain = [primary]
for name in self.fallback_order:
if name != primary:
chain.append(name)
return chain
4️⃣ Agent Core
class Agent:
def __init__(self, router: LLMRouter, tools: "ToolRegistry", memory: "Memory"):
self.router = router
self.tools = tools
self.memory = memory
async def execute(self, task: str) -> str:
"""
High‑level entry point:
1. Retrieve relevant context from memory.
2. Build a planning graph.
3. Use the router to get LLM responses.
4. Invoke tools as needed.
"""
# Placeholder – fill in with your planning / tool‑execution logic
raise NotImplementedError
(The implementation of ToolRegistry and Memory is omitted for brevity.)
🎯 Takeaways
- Abstraction – A thin, common interface lets you swap providers without touching the rest of the code.
- Routing + Fallback – Choose the best model for the job and automatically recover from outages.
- Cost & Latency Optimization – Route cheap, fast tasks to Ollama or Groq, and reserve GPT‑4 for heavy reasoning.
With this pattern you can build robust, multi‑model agents that stay responsive and cost‑effective even when a single provider experiences trouble. Happy building!
Agent Execution Flow
tr:
context = await self.memory.get_relevant(task)
messages = [
{"role": "system", "content": self._build_system_prompt(context)},
{"role": "user", "content": task}
]
while True:
response = await self.router.route(
messages,
task_type=self._classify_task(task),
tools=self.tools.get_schemas()
)
if not response.has_tool_calls:
break
tool_results = await self.tools.execute(response.tool_calls)
messages.extend(tool_results)
await self.memory.store(task, response.content)
return response.content
Tool Registry
Tools give the agent the ability to interact with external systems:
class ToolRegistry:
def __init__(self):
self._tools = {}
def register(self, name: str, func, schema: dict):
self._tools[name] = {"func": func, "schema": schema}
async def execute(self, tool_calls):
results = []
for call in tool_calls:
tool = self._tools[call.name]
result = await tool["func"](**call.arguments)
results.append({
"role": "tool",
"content": str(result),
"tool_call_id": call.id
})
return results
@classmethod
def default(cls):
registry = cls()
registry.register("web_search", web_search, web_search_schema)
registry.register("code_execute", code_execute, code_execute_schema)
registry.register("file_read", file_read, file_read_schema)
return registry
Provider Configuration
providers = {
"openai": OpenAIProvider("gpt-4"),
"ollama": OllamaProvider("llama3"),
"groq": GroqProvider("llama3-70b-8192"),
"gemini": GeminiProvider("gemini-pro")
}
router = LLMRouter(providers)
tools = ToolRegistry.default()
memory = RedisMemory(url="redis://localhost:6379")
agent = Agent(router=router, tools=tools, memory=memory)
result = await agent.execute(
"Analyze the performance bottlenecks in our API and suggest fixes"
)
Multi‑Model Routing for Cost Control
One of the biggest benefits of multi‑model routing is cost control. Below is a practical routing strategy:
| Task Type | Provider | Cost per 1 M tokens |
|---|---|---|
| Complex reasoning | OpenAI GPT‑4 | $30 |
| Simple Q&A | Groq LLaMA 3 | $0.59 |
| Code generation | Ollama (local) | $0 |
| Image analysis | Gemini Pro | $0.50 |
Result: By routing 70 % of requests to Groq/Ollama and reserving GPT‑4 for complex tasks, we reduced our monthly AI costs by 80 %.
Lessons Learned
- Provider abstraction pays off fast. When one API experiences an outage, the system keeps running.
- Latency varies wildly. Groq averages ~200 ms vs. OpenAI’s 1–2 s, which makes a real difference for interactive applications.
- Local models are underrated. Ollama with LLaMA 3 handles ~80 % of tasks without any external API calls.
- Memory is the hard part. Deciding what to remember and what to forget matters more than which model you use.
Source Code & Community
The full source code is available on GitHub: ai-agent-framework
If you are building AI agents or working with multiple LLM providers, I’d love to hear about your approach. Drop a comment below or connect with me on GitHub.