Eval workflow for agentic builders: fork any prompt through baseline vs scaffolded agents, blind third-party judge.
Source: Dev.to
What it is
Built an n8n eval workflow that A/B tests any prompt through plain GPT‑4o vs GPT‑4o + a reasoning scaffold, judged by a blind Gemini evaluator. The evaluator is allowed to return “tie” and regularly does. The point is you test on your own tasks and decide.
What it’s actually testing
- Whether the scaffolded agent engages the specific claims in your prompt or stays generic.
- How the scaffold affects sycophancy, depth, and diagnostic procedure.
- Whether different harness modes (reasoning, anti‑deception, memory, code) stress different task types.
- Mode is editable in the HTTP tool’s JSON body.
The diff is often subtle on easy prompts and more pronounced on dual‑load prompts (emotional + cognitive claims mixed), advice prompts with a buried false premise, or multi‑variable causal reasoning. Low‑complexity single‑turn tasks often produce ties because GPT‑4o handles them well without a scaffold.
Where you might apply this pattern
- Code review or diagnostic agents – test whether they catch the failure modes you actually care about.
- Content or research workflows – test whether they reduce generic output on your topics.
- Multi‑agent systems – wrap any single‑agent call in the fork to see the effect before integrating permanently.
- Prompt engineering A/B tests – measure the effect of a cognitive layer against your own prompt iterations.
Setup
-
Set three credentials
- OpenAI (both producer agents)
- Google Gemini (blind evaluator)
- Header Auth for the Ejentum API (free key at ejentum.com, 100 calls)
-
Paste a prompt in the n8n chat trigger.
-
Configure the workflow (JSON body for mode selection, etc.).
-
Run the workflow to obtain:
- A vs B output from one run.
- Blind evaluator verdict JSON from the same run.
-
Resources – Workflow JSON, READMEs, and a TypeScript port for IDE setups (Antigravity, Claude Code, Cursor).