We put confidence intervals on our LLM-judge scores. The error bars ate three weeks of 'trend'

Published: 3 days ago (June 11, 2026 at 03:16 PM EDT)

2 min read

Source: Dev.to

We track weekly agreement between an LLM judge and human labels (Cohen’s kappa) on a sample of production traces. For three weeks the point estimates told a story: 0.55, then 0.49, then 0.44. The team started hunting for what “broke” the judge. Then we bootstrapped confidence intervals on each weekly number. At our sample size (50 traces a week), the 95% intervals were roughly plus or minus 0.15. All three weekly estimates sat inside one another’s intervals. The decline we had spent two days investigating was indistinguishable from noise. Stratified the weekly sample by score band and intent instead of sampling uniformly. Rare-but-important slices stopped vanishing from some weeks, which had been a major source of week-to-week wobble.

Report the interval, not the point. The dashboard shows the band. Nobody reacts to a movement smaller than the band. This alone has prevented at least two more pointless investigations.

Escalate on sustained shifts only: consecutive weeks outside the prior band, not a single bad reading.

The part that surprised me

How rare this practice is. Most eval dashboards I have seen show single kappa or accuracy numbers with no uncertainty at all, and teams retune judges off moves of 0.05. We would never accept that for an A/B test; somehow it became normal for eval metrics. import numpy as np def kappa_ci(judge, human, n_boot=2000, alpha=0.05): from sklearn.metrics import cohen_kappa_score idx = np.arange(len(judge)); stats = [] for _ in range(n_boot): s = np.random.choice(idx, size=len(idx), replace=True) stats.append(cohen_kappa_score(judge[s], human[s])) lo, hi = np.percentile(stats, [100alpha/2, 100(1-alpha/2)]) return lo, hi

Open question I am still chewing on: consecutive-weeks-outside-band is a crude escalation rule. If you use something sharper for eval metrics (CUSUM, control charts), I would like to hear how it behaves in practice on noisy judge data.

We put confidence intervals on our LLM-judge scores. The error bars ate three weeks of 'trend'

Related posts

Launching BonVoyage: From Travel Problem to Public Launch

The spec is in the wrong place

Incident Automation: What to Automate, What to Leave to Humans

The Heuristics Say Don't