Goodhart의 법칙이 에이전트 평가에까지 미쳤다… 그린 대시보드가 의미 상실

발행: 12시간 전 (2026년 6월 21일 AM 10:02 GMT+9)

9 분 소요

출처: Dev.to

에 모든 에이전트 팀의 삶에서 아무도 로드맵에 넣지 않는 특정한 순간이 있다. 당신은 평가 세트를 구축한다. 실제 버그를 포착한다. CI에 연결해 릴리즈 게이트로 만든다. 대시보드가 초록색으로 변한다. 그리고 다음 3개월 동안, 초록색이 의미가 사라진다 — 모두가 여전히 그것이 의미한다고 생각한다.

이것은 굿하트의 법칙이며, 계획하든 안 하든 여러분의 에이전트 평가에까지 올 것이라는 법이다.
""When a measure becomes a target, it ceases to be a good measure.""

평가 세트가 배포할 것을 결정하는 날이 오면,它不再 중립적인 품질 측정치가 되고, 팀이 최적화하려는 대상이 된다. 이는 가설적 위험이 아니라 기본적인 흐름이다. 대부분의 팀은 생산 환경에 “완전히 통과”한 릴리즈가 조용히 상황을 악화시킨 후에야 이를 깨닫는다.

이衰退은 지루하기 때문에 더 위험하다. 다음은 일반적인 시퀀스이다:

You write evals against the bugs you already found. Reasonable.
당신은 이미 발견한 버그에 대한 평가를 작성한다. 타당하다.
But now your suite measures yesterday’ s failure modes, not tomorrow’ s.
하지만 이제 당신의 세트는昨日的 실패 모드를 측정하고, 내일의 실패 모드는 아니다.
A change fails one case. Instead of asking “did we regress?”, someone asks “is the eval too strict?” and tweaks the assertion until it’ s green.
변경 사항이 한 건 실패한다. “다시 후퇴했는가?” 대신 누군가는 “평가가 너무 엄격한가?” 라고 물으며 어설션을 수정해 초록색으로 만든다.
Prompts get tuned to the eval set. Few-shot examples drift toward the exact phrasings your judge rewards.
프롬프트는 평가 세트에 맞게 튜닝된다. 퓨쇼트 예시는 심판이 보상하는 정확한 표현에 기울어 드리프트한다.
The agent gets better at your test cases and no better at the actual job.
에이전트는 테스트 케이스에서는 더 잘하지만 실제 업무에서는 그렇지 않다.
The held-out set quietly becomes the training set. Every case you debug against is a case you’ ve now overfit to.
held‑out 집합이 조용히 훈련 세트로 변한다. 디버깅한 모든 사례는 이미 과적합된 사례가 된다.
The endpoint is an agent with a 98% pass rate that is measurably worse for users — because the score is now measuring how well the agent satisfies the test, not how well it does the work. The map replaced the territory.
엔드포인트는 사용자에게 측정 가능한 더 나빠지는 98% 통과율을 가진 에이전트가 된다 — 점수는 이제 테스트를 만족하는 방법에 초점을 맞추고, 실제 작업을 잘 수행하는지에 대해서는 그렇지 않다.
The cleanest signal that Goodhart has arrived is this — a release passes the gate, and nobody on the team can explain why a specific borderline case passed. It just did. The score is a number with no narrative behind it.
굿하트가 왔다는 가장 깔끔한 신호는 이것이다 — 릴리즈가 게이트를 통과하고 팀 내 누구도 특정 경계 사례가 왜 통과했는지 설명할 수 없다. 그냥 통과했을 뿐이다. 점수는 설명 없이 단순히 숫자일 뿐이다.
That’ s the real problem. A pass/ fail bit is not a measurement you can reason about. It’ s a measurement you can only trust or distrust. And trust, unaudited, always decays toward green.
이것이 진짜 문제다. 패스/패일 비트는 추론할 수 있는 측정이 아니라, 신뢰하거나 불신할 수만 있는 측정이다. 감사받지 않은 신뢰는 언제나 초록색으로 기울어져衰退한다.
This is exactly the seam where the two tools I lean on have to work as one unit, not as separate dashboards.
이것은 내가 의존하는 두 도구가 별도 대시보드 대신 하나로 동작해야 하는 정확히 그 지점이다.

agent-eval scores and gates the output. It runs the deterministic checks, the model- as-judge rubrics, the drift and hallucination signals — and it returns a verdict on what the agent produced.
AgentLens captures the trace of how the agent got there. Every model call and tool step, the resolved inputs (after templating, not the raw template), and the raw outputs before any post-processing.

Neither half is sufficient alone, and that’ s the entire point. A bare eval score is a target waiting to be gamed. A bare trace is forensic data with no verdict attached. You need agent- eval’ s score anchored to AgentLens’ s trace so that every gate decision carries a “show me why” attached to it.

When a borderline case flips, you don’ t argue about whether the eval is too strict — you open the trace, see the resolved prompt and the exact tool output, and find out whether the agent actually reasoned correctly or got lucky on a phrasing.
그 연결은 측정을 honesty하게 유지하는 것이다. 평가는 게이트가 뒤집혔다는 것을 알려주고, 트레이스는 그 뒤집이가 진정한 것인지를 말해준다.

The anti-pattern is a gate that returns a boolean and nothing else:

// Goodhart bait: a verdict with no evidence behind it.
async function gate( testCase: TestCase): Promise {
  const output = await runAgent(testCase.input);
  return judge(output, testCase.expected) >= 0.8; // green or red, no "why"
}

The fix is to make the score and the trace travel together, so a passing case is auditable, not just countable:

import { evaluate } from "agent-eval";
import { trace } from "agentlens";

interface GatedResult {
  passed: boolean;
  score: number;
  traceId: string;      // the receipt
  heldOut: boolean;     // was this case ever debugged against?
}

async function gatedRun(testCase: TestCase): Promise {
   // AgentLens records every model + tool step, resolved inputs, raw outputs.
  const session = trace.start({ caseId: testCase.id });

  const output = await runAgent(testCase.input, { trace: session });

   // agent-eval scores the OUTPUT: deterministic checks + judge rubric + drift.
  const verdict = await evaluate(output, {
    expected: testCase.expected,
    checks: ["schema", "grounding", "drift"],
    judge: "rubric-v3",
  });

  await session.attach({ verdict }); // bind score & trace

  return {
    passed: verdict.score >= 0.8,
    score: verdict.score,
    traceId: session.id,          // open this to see WHY it passed
    heldOut: testCase.heldOut,    // overfit guard, see below
  };
}

Two things in that snippet are doing the anti-Goodhart work. The traceId means no pass is unexplainable — every green is one click from its own evidence. And heldOut is the discipline that keeps the suite from collapsing into a training set.

Tooling won’ t save you from Goodhart on its own. The process around it has to hold the line:

Quarantine a held- out set you never debug against. If you’ ve ever opened the trace for a case to fix a failure, that case is burned for measurement — it’ s now a regression test, not an evaluation. Keep a rotating set you only ever score, never tune toward. When held-out and debugged scores diverge, that gap is your overfit, measured directly.

The uncomfortable conclusion
A green eval dashboard is not evidence that your agent is good. It is evidence that your agent satisfies your evals — and those are only the same thing while you’ re actively defending the gap between them.

The teams that ship reliable agents aren’ t the ones with the highest pass rates. They’ re the ones who can pull up any green checkmark and explain, from the trace, exactly why it earned the pass. agent-eval gives you the verdict; AgentLens gives you the receipt. Keep them bound together, keep a real held-out set, and your dashboard might actually keep meaning something six months from now.

Most won’ t. Now you know why.

Goodhart의 법칙이 에이전트 평가에까지 미쳤다… 그린 대시보드가 의미 상실

관련 글

Build Rails, Not Trains: A Framework for AI Infrastructure in the Global South

AI PR의 CI 검증을 위한 재현 가능한 증거 필요

출력은 저렴하다. 판단은 일이다.

NFT의 진화