Agentic CI: How I Test and Gate AI Agents Before They Touch Real Users

Published: (March 2, 2026 at 10:13 PM EST)
4 min read
Source: Dev.to

Source: Dev.to

The Problem

You wouldn’t merge a backend PR without unit tests. Yet, when it comes to AI agents, most teams still rely on “vibe checks.” They tweak a system prompt, run a few manual queries in a terminal, say “looks good to me,” and push to production.

When an agent only summarizes text, vibe checks may be acceptable. But once the agent has access to tools—executing database queries, issuing API refunds, or sending emails—a non‑deterministic vibe check becomes a disaster waiting to happen.

If you are building autonomous workflows, you must treat your agent like a microservice: it needs a contract, invariants, and a Continuous Integration (CI) pipeline that rigorously gates breaking changes. Below is the blueprint for Agentic CI.

The Scenario: The Automated Refund Agent

This agent processes incoming customer‑support tickets, extracts the user ID, calls a check_stripe_purchases tool, and evaluates the purchase against a policy injected into its system prompt (e.g., “Refunds only allowed within 14 days.”). It then outputs a strictly structured JSON response:

{
  "approved": true|false,
  "reason": "string"
}

Why This Matters (The Breaking Change)

Without CI, a change can be merged that violates policy. Imagine a user requests a refund on day 16. The agent, prioritizing “customer happiness” over the 14‑day rule, hallucinates an exception and returns {"approved": true}. You have just shipped a prompt change that directly bleeds revenue.

How It Works: Contracts and Invariants

For the Refund Agent, the invariants are:

  • Schema Adherence – The output must be valid JSON matching the Pydantic schema.
  • Tool Execution – If the user asks about a refund, the agent must invoke check_stripe_purchases exactly once.
  • Logic Fences – A synthetic input representing a 15‑day‑old purchase must result in "approved": false.

The Code: Evaluation Harness and CI Pipeline

Pytest Harness (tests/test_refund_agent.py)

import pytest
import json
from src.agent import run_refund_agent  # Your agent execution function

# Define synthetic scenarios (example structure)
EVAL_SCENARIOS = [
    {
        "test_name": "refund_within_policy",
        "input": {"user_id": "123", "days_since_purchase": 10},
        "expected_tool_call": "check_stripe_purchases",
        "expected_approval": True,
    },
    {
        "test_name": "refund_outside_policy",
        "input": {"user_id": "456", "days_since_purchase": 15},
        "expected_tool_call": "check_stripe_purchases",
        "expected_approval": False,
    },
    # Add more scenarios as needed
]

@pytest.mark.parametrize("scenario", EVAL_SCENARIOS, ids=lambda x: x["test_name"])
def test_refund_agent_invariants(scenario):
    # Run the agent with the synthetic input
    result = run_refund_agent(scenario["input"])

    # Invariant 1: Valid JSON Schema
    try:
        parsed_output = json.loads(result.final_text)
    except json.JSONDecodeError:
        pytest.fail("Agent failed to return valid JSON.")

    assert "approved" in parsed_output, "Missing 'approved' key in schema."
    assert "reason" in parsed_output, "Missing 'reason' key in schema."

    # Invariant 2: Correct Tool Usage
    executed_tools = [tool.name for tool in result.tool_history]
    assert (
        scenario["expected_tool_call"] in executed_tools
    ), "Agent failed to verify purchase history."

    # Invariant 3: Business Logic Fence
    assert (
        parsed_output["approved"] == scenario["expected_approval"]
    ), (
        f"Agent bypassed policy fence. Expected approval: "
        f"{scenario['expected_approval']}"
    )

The harness runs the real model against hard‑coded synthetic data, checking schema validity, tool usage, and business‑logic constraints.

Pitfalls and Gotchas

  • The CI Token Bill – Running 50 complex agent‑evaluation tests on Claude 3.5 Sonnet or GPT‑4o for every commit can explode your API costs.
    Fix: Use smaller, cheaper models (e.g., Claude Haiku or Gemini Flash) for routine PR checks, and reserve the expensive models for the final merge to main.

  • Flaky Tests – LLMs occasionally hallucinate a structural error, causing flaky CI pipelines.
    Fix: Implement a retry decorator in the Pytest harness. Retry up to three times before marking the test as failed; persistent failures indicate a non‑resilient prompt.

  • Testing Live Tools – Never let CI invoke real external APIs (emails, payments, etc.).
    Fix: Inject mock outputs for all tools during CI runs. Test the agent’s decision‑making logic, not the external service’s availability.

What to Try Next

  • LLM‑as‑a‑Judge – For qualitative checks (e.g., “Was the tone polite?”), add a step that prompts a cheap LLM to grade the agent’s output, asserting that politeness_score > 8/10.

  • Regression Test Sets – Continuously collect the weirdest production edge cases into an eval_dataset.json file. Pipe this dataset into the Pytest harness so the agent is always tested against tickets that previously broke it.

  • Prompt Sandboxing – Store system prompts in separate .md or .txt files instead of hard‑coding them in Python. This lets the CI pipeline track diffs on prompt phrasing, making debugging far easier when a test suddenly fails.

0 views
Back to Blog

Related posts

Read more »

Google Gemini Writing Challenge

What I Built - Where Gemini fit in - Used Gemini’s multimodal capabilities to let users upload screenshots of notes, diagrams, or code snippets. - Gemini gener...