pytest-aitest: Unit Tests Can't Test Your MCP Server. AI Can.

Published: 3 days ago (February 12, 2026 at 10:58 PM EST)

6 min read

Source: Dev.to

I Learned This the Hard Way

I built two MCP servers — Excel MCP Server and Windows MCP Server. Both had solid test suites, but both broke the moment a real LLM tried to use them.

I spent weeks doing manual testing with GitHub Copilot: open a chat, type a prompt, watch the LLM pick the wrong tool, tweak the description, try again.
Sometimes the design was fundamentally broken and I chased a wild goose for weeks before realizing the whole approach needed re‑thinking.

The failure modes were always the same

#	Symptom
1	The LLM picks the wrong tool out of 15 similar‑sounding options
2	It passes `{"account_id": "checking"}` when the parameter is `account`
3	It ignores the system prompt entirely
4	It asks the user “Would you like me to do that?” instead of just doing it

Why? Because I was testing the code, not the AI interface.

For LLMs, your API isn’t just functions and types — it’s tool descriptions, parameter schemas, and system prompts. That’s what the model actually reads.

No compiler catches a bad tool description.
No unit test validates that an LLM will pick the right tool.
And if you also inject Agent Skills — do they actually help, or make things worse? Do LLMs really behave the way you think they will?

(No. They don’t.)

Introducing pytest‑aitest

Heavily inspired by agent‑benchmark by Dmytro Mykhaliev, pytest‑aitest is a pytest plugin that lets you write AI‑centric tests with zero new CLI or syntax. It works with your existing fixtures, markers, and CI/CD pipelines.

How it works

Your test is a prompt. Write what a user would say, let the LLM figure out how to use your tools, then assert on what happened.

# test_balance_query.py
from pytest_aitest import Agent, Provider, MCPServer

async def test_balance_query(aitest_run):
    agent = Agent(
        provider=Provider(model="azure/gpt-5-mini"),
        mcp_servers=[MCPServer(command=["python", "-m", "my_banking_server"])],
    )

    result = await aitest_run(agent, "What's my checking balance?")

    assert result.success
    assert result.tool_was_called("get_balance")

If this fails, the problem isn’t your code — it’s your tool description. The LLM couldn’t figure out which tool to call or what parameters to pass. Fix the description, run again. This is TDD for AI interfaces.

async def test_transfer(aitest_run):
    result = await aitest_run(agent, "Move $200 from checking to savings")
    assert result.tool_was_called("transfer")

Before vs. after a good description

# Before — too vague
@mcp.tool()
def transfer(from_acct: str, to_acct: str, amount: float) -> str:
    """Transfer money."""

# After — the LLM knows exactly what to do
@mcp.tool()
def transfer(from_account: str, to_account: str, amount: float) -> str:
    """Transfer money between accounts (checking, savings).
    Amount must be positive. Returns new balances for both accounts."""

Run again – the test now passes.

Automated Failure Analysis

pytest‑aitest doesn’t just give you a pass/fail. It spins up a second LLM that analyses every failure and tells you why it happened and how to improve. Traditional testing forces a human to interpret failures; here the AI does it for you.

The report tells you which model to deploy, why it wins, and what to fix.
It analyses cost‑efficiency, tool‑usage patterns, and prompt effectiveness across all configurations.
Unused tools? Flagged.
Prompts that cause permission‑seeking behavior? Explained.

See a full sample report → (link placeholder)

Comparing Configurations

You can test multiple configurations against the same suite:

MODELS = ["gpt-5-mini", "gpt-4.1"]
PROMPTS = {"brief": "Be concise.", "detailed": "Explain your reasoning."}

AGENTS = [
    Agent(
        name=f"{model}-{prompt_name}",
        provider=Provider(model=f"azure/{model}"),
        mcp_servers=[banking_server],
        system_prompt=prompt,
    )
    for model in MODELS
    for prompt_name, prompt in PROMPTS.items()
]

@pytest.mark.parametrize("agent", AGENTS, ids=lambda a: a.name)
async def test_balance_query(aitest_run, agent):
    result = await aitest_run(agent, "What's my checking balance?")
    assert result.success

Leaderboard (pass‑rate → cost)

Agent	Pass Rate	Tokens	Cost
gpt-5-mini‑brief	100 %	747	$0.002
gpt-4.1‑brief	100 %	560	$0.008
gpt-5-mini‑detailed	100 %	1,203	$0.004

Deploy: gpt-5-mini with the brief prompt – 100 % pass rate at the lowest cost.

The same pattern works for:

A/B testing server versions (did your refactor break tool discoverability?)
Comparing system prompts
Measuring the impact of Agent Skills

Session‑Based Tests (conversational flow)

Real users don’t ask a single question; they have a conversation.

@pytest.mark.session("banking-chat")
class TestBankingConversation:
    async def test_check_balance(self, aitest_run, agent):
        result = await aitest_run(agent, "What's my checking balance?")
        assert result.success

    async def test_transfer(self, aitest_run, agent):
        # Agent remembers we were talking about checking
        result = await aitest_run(agent, "Transfer $200 to savings")
        assert result.tool_was_called("transfer")

    async def test_verify(self, aitest_run, agent):
        # Agent remembers the transfer
        result = await aitest_run(agent, "What are my new balances?")
        assert result.success

Tests share conversation history, and the report shows the full session flow with sequence diagrams.

Who Benefits?

Audience	Benefit
MCP server authors	Validate that LLMs can actually use your tools, not just that the code works
Agent builders	Find the cheapest model + prompt combo that passes your test suite
Teams shipping AI products	Gate deployments on LLM‑facing regression tests in CI/CD
Everyone	Works with 100+ LLM providers via LiteLLM – Azure, OpenAI, Anthropic, Google, local models, etc.

TL;DR

The test is a prompt. The LLM is the test harness. The report tells you what to fix.

Traditional testing validates that your code works. pytest‑aitest validates that an LLM can understand and use your code. These are different things, and both matter.

pytest‑aitest

Test your AI interfaces. AI analyzes your results.

A pytest plugin for test‑driven development of MCP servers, tools, prompts, and skills. Write tests first. Let the AI analysis drive your design.

Why?

Your MCP server passes all unit tests. Then an LLM tries to use it and:

picks the wrong tool,
passes garbage parameters, or
ignores your system prompt.

Because you tested the code, not the AI interface.

For LLMs, your API is:

tool descriptions,
schemas, and
prompts

—not functions and types. No compiler catches a bad tool description. No linter flags a confusing schema. Traditional tests can’t validate them.

How It Works

Write tests as natural‑language prompts. An Agent bundles an LLM with your tools — you assert on what happened:

from pytest_aitest import Agent, Provider, MCPServer

async def test_balance_query(aitest_run):
    agent = Agent(
        provider=Provider,
        # … configure your tools / server here …
    )
    # … write your natural‑language test here …

Quick Links

Documentation – Full guides and API reference
PyPI – uv add pytest-aitest
Sample Report – See AI analysis in action
GitHub – Star pytest‑aitest on GitHub

Contributions welcome! Open source and community‑driven.