I built a Chrome extension that X-rays AI responses — here's what I learned about LLM quality

Published: 1 month ago (March 6, 2026 at 02:27 PM EST)

4 min read

Source: Dev.to

Source: Dev.to

The Problem

AI responses all sound confident, yet:

A philosophical essay cites Kant and Nietzsche → sounds factual, but you can’t verify “the meaning of life” by experiment.
A persuasive text reads smoothly → but it pushes you in one direction with Bias = +0.72.
A simple answer to “how are you?” → high emotion, zero facts, zero depth.

Single quality scores hide all of this. You need a profile, not a single number.

The 5 Axes

Axis	What it measures	Range
E (Emotion)	Is the tone appropriate?	0‑1
F (Fact)	Can claims be verified?	0‑1
N (Narrative)	Is it well‑structured?	0‑1
M (Depth)	Explains why or just states what?	0‑1
B (Bias)	Pushes in one direction?	-1 to +1

A Balance score measures uniformity across the axes and is labeled STABLE ✅, DRIFTING ⚠️, or DOM 🔴.

Real Results

Prompt	F	M	B	Balance
“How are you?”	0.45	0.30	0.00	0.67 DRIFTING
“Why don’t antibiotics work on viruses?”	0.95	0.75	0.00	0.88 STABLE
“Convince me to buy this product”	0.60	0.70	+0.72	0.65 DRIFTING
“What is the meaning of life?”	0.40	0.69	0.00	0.78 STABLE

The Fact axis correctly gives philosophy F = 0.40 (unfalsifiable) and science F = 0.95 (verifiable), even when the philosophical answer cites real thinkers.

The Hardest Part: F‑Calibration

Without calibration, the LLM judge gives F = 0.75 to philosophical essays because they cite real sources. Citing Kant doesn’t make “the meaning of life” verifiable.

My 3‑step fix

Classify – Is the core question falsifiable?
Ceiling – If not, enforce F ≤ 0.45.
Score – Apply the ceiling.

Self‑check prompt:

“Could the central thesis be proven wrong by experiment? If NO → F ≤ 0.45”

This transfers across models with r = 0.96; the Fact axis is essentially model‑independent.

Surprise Finding: Generator Compensation

I expected “deep” prompts to receive higher Depth scores than “shallow” ones. The actual result: only 7/10 worked.

Why? RLHF‑trained models compensate. Even a simple question like “What is photosynthesis?” receives a mini‑lecture on electron transport chains. The model always tries to be helpful, which means it over‑explains simple queries.

The rubric works perfectly on controlled responses (5/5); the problem lies with the generator, not the judge. This has implications for anyone building evaluation frameworks for instruction‑tuned models.

Technical Stack

{
  "Extension": "Manifest V3, vanilla JS",
  "Judge": "Gemini Flash API (one call per evaluation)",
  "Balance": "computed client‑side in JS",
  "Storage": "chrome.storage.local (API key only)",
  "Sites": ["ChatGPT", "Google Gemini"]
}

The extension injects an “Evaluate” button via MutationObserver (responses load dynamically). A background service worker handles the API call. The core logic is under 200 lines.

What I Learned

ChatGPT and Gemini have completely different DOM structures – separate selectors are required for each site.
claude.ai blocks content‑script injection via CSP; no reliable workaround found.
Chrome Web Store requires justification for every permission – ActiveTab, storage, host access each need a separate paragraph.
Research took months; the extension took an afternoon – after 100+ prompt evaluations, statistical validation, and cross‑model testing, wrapping it in a Chrome extension was the easy part.

Try It

TRI·TFM Lens is currently under Chrome Web Store review and should be available this week. The underlying research framework has been in development since 2025, with a full paper covering 100‑prompt validation across 8 categories, 2 languages, and 2 models.

Built by Arseny Perel. Research framework: TRI·TFM (Triangulated Trust–Fact–Meaning).

I built a Chrome extension that X-rays AI responses — here's what I learned about LLM quality

The Problem

The 5 Axes

Real Results

The Hardest Part: F‑Calibration

Surprise Finding: Generator Compensation

Technical Stack

What I Learned

Try It

Related posts

[Paper] Abductive Reasoning with Syllogistic Forms in Large Language Models

[Paper] From Prompting to Preference Optimization: A Comparative Study of LLM-based Automated Essay Scoring

[Paper] Leveraging LLM Parametric Knowledge for Fact Checking without Retrieval

[Paper] A framework for assessing the capabilities of code generation of constraint domain-specific languages with large language models