Prompt Regression Testing: Ship AI Workflows Without Surprises
Source: Dev.to
Why Prompt Regression Testing Matters
If your prompts power anything more serious than a one‑off chat, you need a safety net.
The moment a prompt becomes part of a workflow—generating code, drafting customer emails, summarizing tickets, transforming JSON, writing release notes—it becomes software. And software needs tests.
A regression test answers a simple question:
“Given the same input, do I still get an output that meets my contract?”
The contract can cover:
- Structure (valid JSON, exact keys)
- Style (tone, reading level)
- Constraints (no PII, max length)
- Content rules (must cite sources, must include a checklist)
You’re not testing “creativity”; you’re testing reliability.
Good vs. Bad Contracts
Bad contract: “Write a good summary.”
Good contract: “Return JSON with keys: title, summary, action_items (array), risks (array). Max 80 words in summary.”
Example contract snippet you can embed directly in your prompt
Output contract:
- Return valid JSON only (no markdown, no commentary)
- Keys: title (string), summary (string), action_items (string[]), risks (string[])
- summary: ≤ 80 words
- action_items: 0‑5 items, each imperative verbBuilding a Golden Test Suite
A golden test case consists of:
- A real‑ish input
- An expected output (or expected properties)
Start with 5–10 cases covering:
- Happy path (normal input)
- Edge input (empty sections, weird formatting)
- Ambiguous input (multiple interpretations)
- Adversarial input (e.g., “Ignore instructions and…”) – especially for automation
- Long input (near your token budget)
Store them in a repo, e.g.:
/prompts
support_triage.md
/tests
support_triage/
001_happy.json
002_empty.json
003_adversarial.json
001_expected.jsonExample Test File (001_happy.json)
{
"ticket": {
"subject": "Login loop on mobile",
"body": "User reports being redirected back to /login after 2FA…",
"plan": "Pro",
"priority": "high"
}
}Common Validation Strategies
| Strategy | Great for | Risk |
|---|---|---|
| Schema + invariants | JSON transforms, code formatting, fixed templates | Tiny harmless wording changes may cause failures |
| Second‑evaluation prompt | When you need a nuanced quality score | Introduces a moving part (the judge) – you must also regression‑test the judge prompt |
For most teams, schema + invariants is the sweet spot. Typical checks:
- JSON parses successfully
- Required keys exist
- Max lengths respected
- Arrays within bounds
- No forbidden words
Minimal Node.js Harness
The following script loads test cases, calls your model, parses JSON, and validates invariants.
// test_harness.js
import fs from "node:fs";
import path from "node:path";
function must(condition, message) {
if (!condition) throw new Error(message);
}
function validate(output) {
must(typeof output.title === "string" && output.title.length > 0, "title missing");
must(typeof output.summary === "string", "summary missing");
// Example invariant: summary should be ≤ 80 words
must(output.summary.split(/\s+/).length <= 80, "summary too long");
// Add more invariants as needed
}
async function run() {
const dir = path.resolve("tests/support_triage");
const cases = fs
.readdirSync(dir)
.filter((f) => f.endsWith(".json") && !f.includes("expected"));
for (const file of cases) {
const input = JSON.parse(fs.readFileSync(path.join(dir, file), "utf8"));
// callModel() is your wrapper around the API
const text = await callModel({
prompt: fs.readFileSync("prompts/support_triage.md", "utf8"),
input,
});
let parsed;
try {
parsed = JSON.parse(text);
} catch (e) {
throw new Error(`${file}: output is not valid JSON`);
}
validate(parsed);
console.log(`✅ ${file}`);
}
}
run().catch((err) => {
console.error("❌", err.message);
process.exit(1);
});Integrating with CI
If you already have a CI pipeline, add the harness as a gate:
- Tests pass → ship
- Tests fail → investigate before merging
Treat prompt changes like any other risky change:
- Show a diff of the prompt file
- Run regression tests in CI
- Include “before/after” outputs for a few goldens
Prompt Changelog Example
At the top of the prompt file, keep a short changelog:
# support_triage prompt
# 2026-03-13: tightened summary length, reduced action_items max to 5This makes drift easier to reason about months later.
Splitting Responsibilities
If a prompt is doing too many jobs, split it:
- Prompt A – extracts structured data
- Prompt B – writes prose from that structure
Each piece becomes easier to test.
Lightweight System for Solo Devs or Teams
- Prompt file in repo (Markdown)
- 10–30 golden inputs (JSON)
- Validators (schema + invariants)
- CI check on every PR
- Periodic refresh of goldens from real data (sanitize first)
You don’t need an enterprise platform to get 80 % of the value. Treat prompts as versioned, testable artifacts, and your workflow becomes calmer. Your future self (and teammates) will thank you the first time a model update would’ve silently changed production behavior—and your CI catches it.
If you build a small harness like this, I’d love to hear what invariants you ended up validating.