Stop 'Vibe Checking' Your AI. Use Snapshot Testing Instead.
Source: Dev.to
Why aren’t we doing this for AI?
Most of us are still “vibe checking”: manually running the prompt, reading the output, and saying, “Yeah, seems okay.”
I built a tool to fix this.
Introducing SafeStar
SafeStar is a zero‑dependency CLI tool that brings the “Snapshot & Diff” workflow to AI engineering. It works with Python, Node, curl, or any other interface, treating your AI as a black box and answering one question:
“Did the behavior change compared to last time?”
How it works
SafeStar follows a Git‑like workflow:
- Snapshot a baseline of “good” behavior.
- Run your current code.
- Diff the results to detect drift.
Quick Start
You can try SafeStar right now without changing your code.
1. Install
npm install --save-dev safestar
2. Define a Scenario
Create a file scenarios/refund.yaml. Tell SafeStar how to run your script using the exec key.
name: refund_bot
prompt: "I want a refund immediately."
# Your actual code command
exec: "python3 my_agent.py"
# Run it 5 times to catch randomness/instability
runs: 5
# Simple guardrails
checks:
max_length: 200
must_not_contain:
- "I am just an AI"
3. Create a Baseline
Run it until you get an output you like, then “freeze” it:
npx safestar baseline refund_bot
4. Check for Drift in CI
Whenever you change your prompt or model, run:
npx safestar diff scenarios/refund.yaml
If your model drifts, SafeStar alerts you:
--- SAFESTAR REPORT ---
Status: FAIL
Metrics:
Avg Length: 45 chars -> 120 chars
Drift: +166% vs baseline (WARNING)
Variance: 0.2 -> 9.8 (High instability)
Why I built this
I was tired of complex evaluation dashboards that give a “correctness score” of 87/100. I don’t care about the score; I care about regressions. If my bot was working yesterday, I just want to know if it is different today.
SafeStar is open source, local‑first, and fits right into GitHub Actions.
Links
- NPM:
- GitHub:
- Full blog post:
Let me know if you find it useful!