The Story of Making AI Indistinguishable from Humans: Implementing a Turing Test with LLM Judges

Published: 1 month ago (March 28, 2026 at 01:33 AM EDT)

6 min read

Source: Dev.to

Source: Dev.to

Starting from HL 4.1

The first prototype of human‑persona scored 4.1 / 10 on “human‑likeness.”
That is far below the threshold between “AI‑like” and “human‑like,” and I built it and scored it myself.

It took five versions to raise this to HL 7.7. This article tells the story of that journey—what I tried, what didn’t work, and what worked dramatically.

Evaluation Method: LLM Judge

I had Claude Sonnet act as an “expert in distinguishing humans from AI” to score the outputs.

JUDGE_PROMPT = """
You are an expert in distinguishing humans from AI.
Evaluate the following message and respond with JSON only:

{
  "human_likeness_score": 1-10,
  "style_variation_rate": 0.0-1.0,
  "timing_naturalness": 1-10,
  "reason_human_likeness": "Reason in one sentence",
  "improvement_suggestion": "Improvement suggestion in one sentence"
}
"""

Three metrics

Metric	Meaning	Target Value
HL (human_likeness_score)	How non‑AI‑like it is	≥ 7.5
SV (style_variation_rate)	Not too homogeneous (lower is better)	≤ 0.35
TN (timing_naturalness)	Is the timing natural?	≥ 6.0

v1 – Returning Only Parameters (HL 4.1)

base_persona.py returned only emotional state, recommended style, and response delay. Text generation was manual.

HL: 4.1 / SV: 0.64 / TN: 4.1

Judge’s diagnosis: “The writing style is too uniform. The sentences have the same structure every time.”

The parameters weren’t reflected in the text – it was like having blueprints but not building the house.

v2 – Text Generation with Anthropic API (HL 6.1, but…)

Integrated the Claude API, passing the emotional state into the system prompt for text generation.

HL: 6.1 / SV: 0.56 / TN: 3.5

HL jumped from 4.1 → 6.1, but TN dropped from 4.1 → 3.5.

Why? The API response was too fast. Although a “2‑minute delay” was set, messages returned in 0.3 s, and that delay information wasn’t reflected when passed to the judge. The design only calculated and returned a delay; it didn’t actually wait.

Lesson: A TimingController that only returns a value is meaningless. It must either wait for the specified seconds or embed metadata such as “this reply was sent N minutes later.”

For v3 I adopted the latter approach: adding context like “This message was sent N minutes later” to the system prompt.

v3 – Reflecting Cultural Context (HL 6.8)

Reflected config/ja.json’s context_level: 0.85 (high‑context culture) into the system prompt.

HL: 6.8 / SV: 0.50 / TN: 4.5

What changed? Added a rule:

“In Japanese business communication, there is a tendency to avoid direct negation and let the meaning be inferred from context.”

Result: “I’m sorry, but that’s difficult” became “Let me think about it a bit.”
HL + 0.7, TN + 1.0 (thanks to the delay metadata).

SV improved only slightly (0.56 → 0.50); stylistic uniformity remained.

v4 – Filler Insertion & Structural Variation (HL 7.2)

Based on the Ablation Study, I added fillers and structural variation.

HL: 7.2 / SV: 0.50 / TN: 4.5

HL + 0.4. The change worked, but SV stayed stuck at 0.50.
Even with fillers, they appeared in the same position each time, so the surrounding structure was still uniform. The issue wasn’t lack of randomness but the unchanged overall skeleton.

This hinted that “superficial transformations have limits,” foreshadowing the later decision to freeze the pipeline.

v5 – Banned Phrases + Tone Mirroring (HL 7.7)

The final 0.5‑point gain came from two simple ideas.

Discovering Banned Phrases

Many replies started with “Thank you for your message.” Humans rarely say that after the first exchange, but LLMs do.

"banned_phrases": [
  "Thank you for your message",
  "Please feel free to reach out",
  "Feel free to contact me anytime"
]

I made this list configurable and added a system‑prompt instruction: “Absolutely do not use the following phrases.”
Result: HL + 0.5.

In retrospect, the most important discovery of this project was that improving human‑likeness can be more about what to stop doing than what to add.

Tone Mirroring (for EN)

For English evaluations, I instructed the model to “match the user’s tone”:

Match the formality level of the user's message.
If they use casual language, respond casually.
Never open with 'Thanks for reaching out' unless it's the very first message.

This raised English HL from 7 → 8. “Thanks for reaching out” is the English equivalent of “Thank you for your message.”

Final Results

Version	Changes	HL	SV	TN
v1	Returns only parameters	4.1	0.64	4.1
v2	Text generation with Anthropic API	6.1	0.56	3.5
v3	Reflecting cultural context	6.8	0.50	4.5
v4	Filler insertion & structural variation	7.2	0.50	4.5
v5	Banned phrases + tone mirroring	7.7	0.36	5.5

Honest Retrospective

What Worked Well

Configuring banned phrases – stopping the model from using overly polite, robotic openings gave the biggest single boost.
Cultural‑aware prompting – tailoring the system prompt to high‑context Japanese communication added realism.
Timing metadata – explicitly stating “sent N minutes later” made the timing feel more natural.
Tone mirroring – matching the user’s formality level prevented the model from sounding generic.

What Didn’t Work

Simple filler insertion without changing the surrounding structure left style variation stuck.
Relying on a “delay calculator” without actually waiting or annotating the output added no value.

Take‑away

Improving human‑likeness is often about removing AI‑specific habits (e.g., over‑politeness, uniform phrasing) rather than just adding more variety. A well‑crafted system prompt that tells the model what not to do can be more powerful than one that tells it what to do.

What Worked Well

“Make it stop” approach – turned out to be more powerful than expected.
Tone mirroring – a simple instruction had a large effect.
Injecting cultural context – the context_level field in ja.json actually worked.

What Didn’t Work Well

SV (Stylistic Uniformity) improved from 0.64 → 0.36, but it’s barely missing the 0.35 target. Further improvement requires a more fundamental approach than pipeline‑based post‑processing.
TN (Timing Naturalness) sits at 5.5, falling short of the 6.0 target. There’s still room for improvement in how the TimingController’s value is communicated to the LLM.

Reliability of the LLM judge itself – Even if an LLM judges something as “human‑like,” whether actual humans feel the same is a different matter. I regret chasing numbers without conducting a Human Eval.

This reflection led to the later decision to freeze the pipeline.

Summary

HL rose from 4.1 → 7.7 over 5 versions.
The most effective change was “banned phrases”—simply removing AI‑like stock phrases gave HL +0.5. Human‑likeness can sometimes be improved by subtraction, not addition.
The biggest lesson: chasing numbers alone isn’t enough. Even with an LLM‑judge score of 7.7, we still need separate verification to confirm that a human would think “a human wrote this.”

Repository:

📄 The research in this article is formally published as a preprint

HumanPersonaBase: A Language‑Agnostic Framework for Human‑Like AI Communication

DOI:

The Story of Making AI Indistinguishable from Humans: Implementing a Turing Test with LLM Judges

Starting from HL 4.1

Evaluation Method: LLM Judge

Three metrics

v1 – Returning Only Parameters (HL 4.1)

v2 – Text Generation with Anthropic API (HL 6.1, but…)

v3 – Reflecting Cultural Context (HL 6.8)

v4 – Filler Insertion & Structural Variation (HL 7.2)

v5 – Banned Phrases + Tone Mirroring (HL 7.7)

Discovering Banned Phrases

Tone Mirroring (for EN)

Final Results

Honest Retrospective

What Worked Well

What Didn’t Work

Take‑away

What Worked Well

What Didn’t Work Well

Summary

Related posts

Introduction to RAG (Retrieval-Augmented Generation)

AI Agent Memory Systems: How to Give Your AI Persistent Memory

OpenAI Just Put a Bounty on Prompt Injection. Here's How to Defend Against It Today.

PromptLedger v0.3 — Turning prompt history into a practical review workflow.

Starting from HL 4.1

Evaluation Method: LLM Judge

Three metrics

v1 – Returning Only Parameters (HL 4.1)

v2 – Text Generation with Anthropic API (HL 6.1, but…)

v3 – Reflecting Cultural Context (HL 6.8)

v4 – Filler Insertion & Structural Variation (HL 7.2)

v5 – Banned Phrases + Tone Mirroring (HL 7.7)

Discovering Banned Phrases

Tone Mirroring (for EN)

Final Results

Honest Retrospective

What Worked Well

What Didn’t Work

Take‑away

What Worked Well

What Didn’t Work Well

Summary

Related posts

Introduction to RAG (Retrieval-Augmented Generation)

AI Agent Memory Systems: How to Give Your AI Persistent Memory

OpenAI Just Put a Bounty on Prompt Injection. Here's How to Defend Against It Today.

PromptLedger v0.3 — Turning prompt history into a practical review workflow.

Starting from HL 4.1

v1 – Returning Only Parameters (HL 4.1)

v2 – Text Generation with Anthropic API (HL 6.1, but…)

v3 – Reflecting Cultural Context (HL 6.8)

v4 – Filler Insertion & Structural Variation (HL 7.2)

v5 – Banned Phrases + Tone Mirroring (HL 7.7)