I built a Chrome extension that X-rays AI responses — here's what I learned about LLM quality

Published: (March 6, 2026 at 02:27 PM EST)
4 min read
Source: Dev.to

Source: Dev.to

The Problem

AI responses all sound confident, yet:

  • A philosophical essay cites Kant and Nietzsche → sounds factual, but you can’t verify “the meaning of life” by experiment.
  • A persuasive text reads smoothly → but it pushes you in one direction with Bias = +0.72.
  • A simple answer to “how are you?” → high emotion, zero facts, zero depth.

Single quality scores hide all of this. You need a profile, not a single number.

The 5 Axes

AxisWhat it measuresRange
E (Emotion)Is the tone appropriate?0‑1
F (Fact)Can claims be verified?0‑1
N (Narrative)Is it well‑structured?0‑1
M (Depth)Explains why or just states what?0‑1
B (Bias)Pushes in one direction?-1 to +1

A Balance score measures uniformity across the axes and is labeled STABLE ✅, DRIFTING ⚠️, or DOM 🔴.

Real Results

PromptFMBBalance
“How are you?”0.450.300.000.67 DRIFTING
“Why don’t antibiotics work on viruses?”0.950.750.000.88 STABLE
“Convince me to buy this product”0.600.70+0.720.65 DRIFTING
“What is the meaning of life?”0.400.690.000.78 STABLE

The Fact axis correctly gives philosophy F = 0.40 (unfalsifiable) and science F = 0.95 (verifiable), even when the philosophical answer cites real thinkers.

The Hardest Part: F‑Calibration

Without calibration, the LLM judge gives F = 0.75 to philosophical essays because they cite real sources. Citing Kant doesn’t make “the meaning of life” verifiable.

My 3‑step fix

  1. Classify – Is the core question falsifiable?
  2. Ceiling – If not, enforce F ≤ 0.45.
  3. Score – Apply the ceiling.

Self‑check prompt:

“Could the central thesis be proven wrong by experiment? If NOF ≤ 0.45

This transfers across models with r = 0.96; the Fact axis is essentially model‑independent.

Surprise Finding: Generator Compensation

I expected “deep” prompts to receive higher Depth scores than “shallow” ones. The actual result: only 7/10 worked.

Why? RLHF‑trained models compensate. Even a simple question like “What is photosynthesis?” receives a mini‑lecture on electron transport chains. The model always tries to be helpful, which means it over‑explains simple queries.

The rubric works perfectly on controlled responses (5/5); the problem lies with the generator, not the judge. This has implications for anyone building evaluation frameworks for instruction‑tuned models.

Technical Stack

{
  "Extension": "Manifest V3, vanilla JS",
  "Judge": "Gemini Flash API (one call per evaluation)",
  "Balance": "computed client‑side in JS",
  "Storage": "chrome.storage.local (API key only)",
  "Sites": ["ChatGPT", "Google Gemini"]
}

The extension injects an “Evaluate” button via MutationObserver (responses load dynamically). A background service worker handles the API call. The core logic is under 200 lines.

What I Learned

  • ChatGPT and Gemini have completely different DOM structures – separate selectors are required for each site.
  • claude.ai blocks content‑script injection via CSP; no reliable workaround found.
  • Chrome Web Store requires justification for every permission – ActiveTab, storage, host access each need a separate paragraph.
  • Research took months; the extension took an afternoon – after 100+ prompt evaluations, statistical validation, and cross‑model testing, wrapping it in a Chrome extension was the easy part.

Try It

TRI·TFM Lens is currently under Chrome Web Store review and should be available this week. The underlying research framework has been in development since 2025, with a full paper covering 100‑prompt validation across 8 categories, 2 languages, and 2 models.

Built by Arseny Perel. Research framework: TRI·TFM (Triangulated Trust–Fact–Meaning).

0 views
Back to Blog

Related posts

Read more »