Prompt Regression Testing: Ship AI Workflows Without Surprises

Published: (March 13, 2026 at 11:23 AM EDT)
4 min read
Source: Dev.to

Source: Dev.to

Why Prompt Regression Testing Matters

If your prompts power anything more serious than a one‑off chat, you need a safety net.
The moment a prompt becomes part of a workflow—generating code, drafting customer emails, summarizing tickets, transforming JSON, writing release notes—it becomes software. And software needs tests.

A regression test answers a simple question:

“Given the same input, do I still get an output that meets my contract?”

The contract can cover:

  • Structure (valid JSON, exact keys)
  • Style (tone, reading level)
  • Constraints (no PII, max length)
  • Content rules (must cite sources, must include a checklist)

You’re not testing “creativity”; you’re testing reliability.

Good vs. Bad Contracts

Bad contract: “Write a good summary.”

Good contract: “Return JSON with keys: title, summary, action_items (array), risks (array). Max 80 words in summary.”

Example contract snippet you can embed directly in your prompt

Output contract:
- Return valid JSON only (no markdown, no commentary)
- Keys: title (string), summary (string), action_items (string[]), risks (string[])
- summary: ≤ 80 words
- action_items: 0‑5 items, each imperative verb

Building a Golden Test Suite

A golden test case consists of:

  1. A real‑ish input
  2. An expected output (or expected properties)

Start with 5–10 cases covering:

  • Happy path (normal input)
  • Edge input (empty sections, weird formatting)
  • Ambiguous input (multiple interpretations)
  • Adversarial input (e.g., “Ignore instructions and…”) – especially for automation
  • Long input (near your token budget)

Store them in a repo, e.g.:

/prompts
  support_triage.md
/tests
  support_triage/
    001_happy.json
    002_empty.json
    003_adversarial.json
    001_expected.json

Example Test File (001_happy.json)

{
  "ticket": {
    "subject": "Login loop on mobile",
    "body": "User reports being redirected back to /login after 2FA…",
    "plan": "Pro",
    "priority": "high"
  }
}

Common Validation Strategies

StrategyGreat forRisk
Schema + invariantsJSON transforms, code formatting, fixed templatesTiny harmless wording changes may cause failures
Second‑evaluation promptWhen you need a nuanced quality scoreIntroduces a moving part (the judge) – you must also regression‑test the judge prompt

For most teams, schema + invariants is the sweet spot. Typical checks:

  • JSON parses successfully
  • Required keys exist
  • Max lengths respected
  • Arrays within bounds
  • No forbidden words

Minimal Node.js Harness

The following script loads test cases, calls your model, parses JSON, and validates invariants.

// test_harness.js
import fs from "node:fs";
import path from "node:path";

function must(condition, message) {
  if (!condition) throw new Error(message);
}

function validate(output) {
  must(typeof output.title === "string" && output.title.length > 0, "title missing");
  must(typeof output.summary === "string", "summary missing");
  // Example invariant: summary should be ≤ 80 words
  must(output.summary.split(/\s+/).length <= 80, "summary too long");
  // Add more invariants as needed
}

async function run() {
  const dir = path.resolve("tests/support_triage");
  const cases = fs
    .readdirSync(dir)
    .filter((f) => f.endsWith(".json") && !f.includes("expected"));

  for (const file of cases) {
    const input = JSON.parse(fs.readFileSync(path.join(dir, file), "utf8"));

    // callModel() is your wrapper around the API
    const text = await callModel({
      prompt: fs.readFileSync("prompts/support_triage.md", "utf8"),
      input,
    });

    let parsed;
    try {
      parsed = JSON.parse(text);
    } catch (e) {
      throw new Error(`${file}: output is not valid JSON`);
    }

    validate(parsed);
    console.log(`✅ ${file}`);
  }
}

run().catch((err) => {
  console.error("❌", err.message);
  process.exit(1);
});

Integrating with CI

If you already have a CI pipeline, add the harness as a gate:

  • Tests pass → ship
  • Tests fail → investigate before merging

Treat prompt changes like any other risky change:

  1. Show a diff of the prompt file
  2. Run regression tests in CI
  3. Include “before/after” outputs for a few goldens

Prompt Changelog Example

At the top of the prompt file, keep a short changelog:

# support_triage prompt
# 2026-03-13: tightened summary length, reduced action_items max to 5

This makes drift easier to reason about months later.

Splitting Responsibilities

If a prompt is doing too many jobs, split it:

  • Prompt A – extracts structured data
  • Prompt B – writes prose from that structure

Each piece becomes easier to test.

Lightweight System for Solo Devs or Teams

  1. Prompt file in repo (Markdown)
  2. 10–30 golden inputs (JSON)
  3. Validators (schema + invariants)
  4. CI check on every PR
  5. Periodic refresh of goldens from real data (sanitize first)

You don’t need an enterprise platform to get 80 % of the value. Treat prompts as versioned, testable artifacts, and your workflow becomes calmer. Your future self (and teammates) will thank you the first time a model update would’ve silently changed production behavior—and your CI catches it.

If you build a small harness like this, I’d love to hear what invariants you ended up validating.

0 views
Back to Blog

Related posts

Read more »

Travigo

Travel as fast as you speak with Gemini! Where live agents meet immersive storytelling & 3D navigation. This project was created for entering the Gemini Live Ag...

Micro games

Hey Gamers! 👾 As part of the Rapid Games Prototyping module, we are tasked with reviewing a peer's game. The challenge is to analyse a prototype built in just...