We put confidence intervals on our LLM-judge scores. The error bars ate three weeks of 'trend'

Published: (June 11, 2026 at 03:16 PM EDT)
2 min read
Source: Dev.to

Source: Dev.to

We track weekly agreement between an LLM judge and human labels (Cohen’s kappa) on a sample of production traces. For three weeks the point estimates told a story: 0.55, then 0.49, then 0.44. The team started hunting for what “broke” the judge. Then we bootstrapped confidence intervals on each weekly number. At our sample size (50 traces a week), the 95% intervals were roughly plus or minus 0.15. All three weekly estimates sat inside one another’s intervals. The decline we had spent two days investigating was indistinguishable from noise. Stratified the weekly sample by score band and intent instead of sampling uniformly. Rare-but-important slices stopped vanishing from some weeks, which had been a major source of week-to-week wobble.

Report the interval, not the point. The dashboard shows the band. Nobody reacts to a movement smaller than the band. This alone has prevented at least two more pointless investigations.

Escalate on sustained shifts only: consecutive weeks outside the prior band, not a single bad reading.

The part that surprised me

How rare this practice is. Most eval dashboards I have seen show single kappa or accuracy numbers with no uncertainty at all, and teams retune judges off moves of 0.05. We would never accept that for an A/B test; somehow it became normal for eval metrics. import numpy as np def kappa_ci(judge, human, n_boot=2000, alpha=0.05): from sklearn.metrics import cohen_kappa_score idx = np.arange(len(judge)); stats = [] for _ in range(n_boot): s = np.random.choice(idx, size=len(idx), replace=True) stats.append(cohen_kappa_score(judge[s], human[s])) lo, hi = np.percentile(stats, [100alpha/2, 100(1-alpha/2)]) return lo, hi

Open question I am still chewing on: consecutive-weeks-outside-band is a crude escalation rule. If you use something sharper for eval metrics (CUSUM, control charts), I would like to hear how it behaves in practice on noisy judge data.

0 views
Back to Blog

Related posts

Read more »

The spec is in the wrong place

My day job is at a large tech company. Hundreds of engineering teams, and every one of them is somewhere different on AI adoption. Some are still treating codin...

The Heuristics Say Don't

A culture that only records its disasters ends up with a biased archive. Wars documented, plagues chronicled, collapses catalogued. The quiet decades go unwritt...