Agentic CI: How I Test and Gate AI Agents Before They Touch Real Users
Source: Dev.to
The Problem
You wouldn’t merge a backend PR without unit tests. Yet, when it comes to AI agents, most teams still rely on “vibe checks.” They tweak a system prompt, run a few manual queries in a terminal, say “looks good to me,” and push to production.
When an agent only summarizes text, vibe checks may be acceptable. But once the agent has access to tools—executing database queries, issuing API refunds, or sending emails—a non‑deterministic vibe check becomes a disaster waiting to happen.
If you are building autonomous workflows, you must treat your agent like a microservice: it needs a contract, invariants, and a Continuous Integration (CI) pipeline that rigorously gates breaking changes. Below is the blueprint for Agentic CI.
The Scenario: The Automated Refund Agent
This agent processes incoming customer‑support tickets, extracts the user ID, calls a check_stripe_purchases tool, and evaluates the purchase against a policy injected into its system prompt (e.g., “Refunds only allowed within 14 days.”). It then outputs a strictly structured JSON response:
{
"approved": true|false,
"reason": "string"
}
Why This Matters (The Breaking Change)
Without CI, a change can be merged that violates policy. Imagine a user requests a refund on day 16. The agent, prioritizing “customer happiness” over the 14‑day rule, hallucinates an exception and returns {"approved": true}. You have just shipped a prompt change that directly bleeds revenue.
How It Works: Contracts and Invariants
For the Refund Agent, the invariants are:
- Schema Adherence – The output must be valid JSON matching the Pydantic schema.
- Tool Execution – If the user asks about a refund, the agent must invoke
check_stripe_purchasesexactly once. - Logic Fences – A synthetic input representing a 15‑day‑old purchase must result in
"approved": false.
The Code: Evaluation Harness and CI Pipeline
Pytest Harness (tests/test_refund_agent.py)
import pytest
import json
from src.agent import run_refund_agent # Your agent execution function
# Define synthetic scenarios (example structure)
EVAL_SCENARIOS = [
{
"test_name": "refund_within_policy",
"input": {"user_id": "123", "days_since_purchase": 10},
"expected_tool_call": "check_stripe_purchases",
"expected_approval": True,
},
{
"test_name": "refund_outside_policy",
"input": {"user_id": "456", "days_since_purchase": 15},
"expected_tool_call": "check_stripe_purchases",
"expected_approval": False,
},
# Add more scenarios as needed
]
@pytest.mark.parametrize("scenario", EVAL_SCENARIOS, ids=lambda x: x["test_name"])
def test_refund_agent_invariants(scenario):
# Run the agent with the synthetic input
result = run_refund_agent(scenario["input"])
# Invariant 1: Valid JSON Schema
try:
parsed_output = json.loads(result.final_text)
except json.JSONDecodeError:
pytest.fail("Agent failed to return valid JSON.")
assert "approved" in parsed_output, "Missing 'approved' key in schema."
assert "reason" in parsed_output, "Missing 'reason' key in schema."
# Invariant 2: Correct Tool Usage
executed_tools = [tool.name for tool in result.tool_history]
assert (
scenario["expected_tool_call"] in executed_tools
), "Agent failed to verify purchase history."
# Invariant 3: Business Logic Fence
assert (
parsed_output["approved"] == scenario["expected_approval"]
), (
f"Agent bypassed policy fence. Expected approval: "
f"{scenario['expected_approval']}"
)
The harness runs the real model against hard‑coded synthetic data, checking schema validity, tool usage, and business‑logic constraints.
Pitfalls and Gotchas
-
The CI Token Bill – Running 50 complex agent‑evaluation tests on Claude 3.5 Sonnet or GPT‑4o for every commit can explode your API costs.
Fix: Use smaller, cheaper models (e.g., Claude Haiku or Gemini Flash) for routine PR checks, and reserve the expensive models for the final merge tomain. -
Flaky Tests – LLMs occasionally hallucinate a structural error, causing flaky CI pipelines.
Fix: Implement a retry decorator in the Pytest harness. Retry up to three times before marking the test as failed; persistent failures indicate a non‑resilient prompt. -
Testing Live Tools – Never let CI invoke real external APIs (emails, payments, etc.).
Fix: Inject mock outputs for all tools during CI runs. Test the agent’s decision‑making logic, not the external service’s availability.
What to Try Next
-
LLM‑as‑a‑Judge – For qualitative checks (e.g., “Was the tone polite?”), add a step that prompts a cheap LLM to grade the agent’s output, asserting that
politeness_score > 8/10. -
Regression Test Sets – Continuously collect the weirdest production edge cases into an
eval_dataset.jsonfile. Pipe this dataset into the Pytest harness so the agent is always tested against tickets that previously broke it. -
Prompt Sandboxing – Store system prompts in separate
.mdor.txtfiles instead of hard‑coding them in Python. This lets the CI pipeline track diffs on prompt phrasing, making debugging far easier when a test suddenly fails.