Eval workflow for agentic builders: fork any prompt through baseline vs scaffolded agents, blind third-party judge.

Published: 6 hours ago (April 22, 2026 at 03:17 AM EDT)

2 min read

Source: Dev.to

What it is

Built an n8n eval workflow that A/B tests any prompt through plain GPT‑4o vs GPT‑4o + a reasoning scaffold, judged by a blind Gemini evaluator. The evaluator is allowed to return “tie” and regularly does. The point is you test on your own tasks and decide.

What it’s actually testing

Whether the scaffolded agent engages the specific claims in your prompt or stays generic.
How the scaffold affects sycophancy, depth, and diagnostic procedure.
Whether different harness modes (reasoning, anti‑deception, memory, code) stress different task types.
- Mode is editable in the HTTP tool’s JSON body.

The diff is often subtle on easy prompts and more pronounced on dual‑load prompts (emotional + cognitive claims mixed), advice prompts with a buried false premise, or multi‑variable causal reasoning. Low‑complexity single‑turn tasks often produce ties because GPT‑4o handles them well without a scaffold.

Where you might apply this pattern

Code review or diagnostic agents – test whether they catch the failure modes you actually care about.
Content or research workflows – test whether they reduce generic output on your topics.
Multi‑agent systems – wrap any single‑agent call in the fork to see the effect before integrating permanently.
Prompt engineering A/B tests – measure the effect of a cognitive layer against your own prompt iterations.

Setup

Set three credentials
- OpenAI (both producer agents)
- Google Gemini (blind evaluator)
- Header Auth for the Ejentum API (free key at ejentum.com, 100 calls)
Paste a prompt in the n8n chat trigger.
Configure the workflow (JSON body for mode selection, etc.).
Run the workflow to obtain:
- A vs B output from one run.
- Blind evaluator verdict JSON from the same run.
Resources – Workflow JSON, READMEs, and a TypeScript port for IDE setups (Antigravity, Claude Code, Cursor).

Eval workflow for agentic builders: fork any prompt through baseline vs scaffolded agents, blind third-party judge.

What it is

What it’s actually testing

Where you might apply this pattern

Setup

Related posts

Why Every AI-Coded App Is an Island

I Shouldn’t Be Sharing This: 37 Google Dork Patterns That Still Surface Exposed AWS Keys in 2026

The Freelance Rate Formula I Wish I Knew Earlier (Free Calculator Included)

Deconstructing X (Twitter) Media Streaming: Building a High-Performance Video Extraction Engine