My First Foray into Harness Engineers
Source: Dev.to
Introduction
When our team set out to build BypassHire — an AI tool that cuts job‑application time from 45 minutes to under 5 — we quickly realized that the interesting engineering challenge wasn’t the model. The model was the easy part. The hard part was everything around it.
Martin Fowler recently named that surrounding layer harness engineering. A harness is everything in a coding‑agent setup except the model itself — the guides, sensors, rules, and scaffolding that determine whether the agent produces something you can trust. We spent a significant portion of our sprint designing exactly that, and this post describes what we built and why.
Harness Mechanisms
Fowler divides harness mechanisms into two categories.
Guides (feed‑forward controls)
Guides anticipate problems and shape behavior upfront.
-
Project‑level instruction file:
CLAUDE.mdencodes our conventions, enforced workflow order (/plan → /tdd → /build‑fix → /code‑review → /verify), security rules, and coverage thresholds. Every session starts with Claude reading this file, establishing the rules of the game before any code is generated. -
Specialized sub‑agents:
architectplannertdd‑guidecode‑reviewersecurity‑reviewer(OWASP‑scoped, fires on any PR touching auth routes, DB queries, or Claude API prompts)build‑error‑resolver
Rather than one general‑purpose agent deciding everything, we built a council of narrow experts. Specialization reduces the surface area each agent must reason over, which increases reliability.
-
MCP server integrations: GitHub, Playwright, Vercel extend the agent’s reach into external systems without loosening its guardrails. The agent can check deployment status or run a browser test; it cannot make unreviewed changes to production.
Sensors (feedback controls)
Sensors observe what was generated and signal corrections. This is where Claude Code’s hook system becomes infrastructure.
We run four hooks wired into the session lifecycle:
- Lint on edit –
ESLintandPrettierfire immediately after any.tsor.tsxfile is touched. The feedback loop collapses to milliseconds. - Test‑runner guard on exit – When Claude tries to end the session,
npm testruns automatically. If tests fail, the session is blocked from closing. “I’ll fix it later” becomes structurally impossible. - Green‑test commit signal – After any test‑runner Bash call exits
0, a hook injects a nudge asking whether to checkpoint. The signal is only a suggestion; Claude decides, and the hook never runsgit commititself. - Task‑completion commit signal – When a todo transitions to completed, a snapshot diff fires the same nudge for that work slice.
Fowler distinguishes computational sensors (fast, deterministic: linters, type checkers, tests) from inferential sensors (AI‑powered, slower, but semantically richer). Our hooks are computational. Our code‑reviewer and security‑reviewer sub‑agents are inferential — they catch architectural and behavioral issues that linters cannot see.
Harnessability
Not every codebase is equally amenable to harnesses. Fowler calls this harnessability — the structural properties that make a system legible to agents.
We chose TypeScript, Prisma, Zod, and Next.js App Router deliberately. Strong types, a schema‑validated ORM, and runtime input validation at every API boundary give both human reviewers and agent sensors precise, machine‑readable contracts to work against. The harness is only as good as the surface it grips.
Steering Loop
Fowler’s most important insight is that harnesses require a steering loop — humans monitoring repeated failures and tightening controls in response.
After our first sprint we noticed Claude occasionally drifted on auth patterns; we added an ESLint rule that bans direct imports from @clerk/nextjs/server, forcing all auth through our src/lib/auth wrapper. The harness got smarter because we observed where it was failing.
That loop — observe → encode → enforce — is the actual job of a harness engineer. The model handles generation. We handle the system that keeps generation honest.