Eval workflow for agentic builders: fork any prompt through baseline vs scaffolded agents, blind third-party judge.

Published: (April 22, 2026 at 03:17 AM EDT)
2 min read
Source: Dev.to

Source: Dev.to

What it is

Built an n8n eval workflow that A/B tests any prompt through plain GPT‑4o vs GPT‑4o + a reasoning scaffold, judged by a blind Gemini evaluator. The evaluator is allowed to return “tie” and regularly does. The point is you test on your own tasks and decide.

What it’s actually testing

  • Whether the scaffolded agent engages the specific claims in your prompt or stays generic.
  • How the scaffold affects sycophancy, depth, and diagnostic procedure.
  • Whether different harness modes (reasoning, anti‑deception, memory, code) stress different task types.
    • Mode is editable in the HTTP tool’s JSON body.

The diff is often subtle on easy prompts and more pronounced on dual‑load prompts (emotional + cognitive claims mixed), advice prompts with a buried false premise, or multi‑variable causal reasoning. Low‑complexity single‑turn tasks often produce ties because GPT‑4o handles them well without a scaffold.

Where you might apply this pattern

  • Code review or diagnostic agents – test whether they catch the failure modes you actually care about.
  • Content or research workflows – test whether they reduce generic output on your topics.
  • Multi‑agent systems – wrap any single‑agent call in the fork to see the effect before integrating permanently.
  • Prompt engineering A/B tests – measure the effect of a cognitive layer against your own prompt iterations.

Setup

  1. Set three credentials

    • OpenAI (both producer agents)
    • Google Gemini (blind evaluator)
    • Header Auth for the Ejentum API (free key at ejentum.com, 100 calls)
  2. Paste a prompt in the n8n chat trigger.

  3. Configure the workflow (JSON body for mode selection, etc.).

  4. Run the workflow to obtain:

    • A vs B output from one run.
    • Blind evaluator verdict JSON from the same run.
  5. Resources – Workflow JSON, READMEs, and a TypeScript port for IDE setups (Antigravity, Claude Code, Cursor).

0 views
Back to Blog

Related posts

Read more »

Why Every AI-Coded App Is an Island

!Cover image for Why Every AI-Coded App Is an Islandhttps://media2.dev.to/dynamic/image/width=1000,height=420,fit=cover,gravity=auto,format=auto/https%3A%2F%2Fd...