Dissecting the Humanization Pipeline for AI Text: A 6-Step Ablation Study

Published: 1 month ago (March 28, 2026 at 01:33 AM EDT)

5 min read

Source: Dev.to

Source: Dev.to

Ablation Study: Which Transformation Steps Really Matter?

“The scores are good. But what’s actually working?”

In the previous article I built a pipeline to make AI‑generated text feel more human‑like and reported benchmark results of Mean Alignment = 0.945 and Distribution Alignment = 0.864. Those numbers look solid, but they don’t tell us which of the six transformation steps are actually contributing and which are just noise.

What I Did

I performed an ablation (removal) study: each step was disabled one‑by‑one and the pipeline was re‑evaluated on a held‑out test set of 500 samples (80 % / 20 % split).

Below are the results.

Results Table

Disabled Step	Mean Alignment	Distribution Alignment	Mean Drop	Dist. Drop
None (Full Pipeline)	0.945	0.864	—	—
Filler Insertion	0.622	0.569	‑0.323	‑0.296
Long Sentence Splitting	0.751	0.720	‑0.194	‑0.144
Short Sentence Insertion (interjection)	0.763	0.742	‑0.182	‑0.122
Hedge Injection	0.808	0.740	‑0.137	‑0.125
Cushion Injection	0.851	0.779	‑0.094	‑0.085
Self‑Correction Injection	0.944	0.866	‑0.001	+0.001
No Pipeline	0.003	0.000	‑0.942	‑0.864

Key take‑aways:
Filler insertion caused the biggest collapse (‑0.323).
Long‑sentence splitting and short‑sentence insertion together contributed more than fillers (‑0.376).
Self‑correction injection had virtually no impact.

Deep Dive into the Surprises

1️⃣ Filler Insertion – The Biggest Contributor (‑0.323)

Metric	Human Text	AI Text	Cohen’s d
Filler Rate (per sentence)	0.165	0.001	1.755 (very large)

Humans regularly use fillers such as “Well,”, “You know,” or “Basically,” (≈ 1 per 6 sentences). AI almost never does, making filler rate the strongest single discriminator.

The False‑Positive Pitfall

My initial implementation flagged any occurrence of the word “like” (\blike\b) as a filler. This mistakenly counted “I like pizza” and inflated the filler rate to > 0.3, leading me to (incorrectly) conclude that humans over‑use fillers.

Fix: Switch to position‑dependent detection:

# NG: Riddled with false positives
FILLER_PATTERNS = [r"\blike\b", r"\bso\b", r"\bwell\b"]

# OK: Detects only filler usage at sentence start + comma
FILLER_START_PATTERNS = [r"^(?:well|so|like)\s*,"]
FILLER_ALWAYS = [r"\byou know\b", r"\bi mean\b", r"\bbasically\b"]

Lesson: In quantitative NLP work, eliminate regex false positives before drawing conclusions.

2️⃣ Self‑Correction Injection – A Failure (‑0.001)

Self‑correction markers (“wait, I mean…”, “sorry, what I meant was…”) barely appear in human business communication (0.19 % / sentence, weight = 0.097). With only 500 samples, the confidence interval is wide ([0.001, 0.004]), burying any effect in noise.

Result: The step was removed from the final pipeline.

3️⃣ Long‑Sentence Splitting & Short‑Sentence Insertion

Step	Primary Effect	Mechanism
Long Sentence Splitting	Reduces Words / Sentence	Cuts average length from 18 → 13 words
Short Sentence Insertion	Increases Sentence‑Length CV	Adds brief interjections (“Hmm.”, “Got it.”)

AI tends to produce uniformly long sentences; humans mix short acknowledgments with longer explanations. Together these steps contribute ‑0.376, surpassing the filler contribution.

Metric Weights (Derived from Discriminative Power)

Metric	Cohen’s d	Weight
Filler Rate	1.755	1.88
Words / Sentence	1.356	1.45
Sentence‑Length CV	1.086	1.16
Hedge Rate	0.818	0.87
Cushion Rate	0.506	0.54
Self‑Correction Rate	0.091	0.10

Effect size d > 0.8 denotes a large effect; thus filler rate, words/sentence, and sentence‑length CV dominate human/AI discrimination.

Limitations & Future Work

Context‑dependence: The current pipeline injects fillers and hedges at a fixed probability. In reality, filler usage varies by topic (more in casual conversation, less in technical writing). This mismatch caused two metrics to fail the KS test.
Automated vs. Human Evaluation: The DPO benchmark rewards superficial feature matching (e.g., presence of fillers or typos) but does not guarantee that a human reader feels the text is human‑written. Human evaluation remains essential.
Sample Size: With only 500 test samples, rare phenomena (e.g., self‑corrections) are hard to assess reliably.

Bottom‑Line Ranking of Steps

Rank	Step	Contribution (Mean Drop)	Take‑away
1	Filler Insertion	‑0.323	Most critical – watch for false positives
2	Long Sentence Splitting	‑0.194	Aligns words‑per‑sentence to human level
3	Short Sentence Insertion	‑0.182	Introduces natural sentence‑length variation
4	Hedge Injection	‑0.137	Adds ambiguity, modest impact
5	Cushion Injection	‑0.094	Inserts polite prefacing (“Sure,”, “Of course,”)
6	Self‑Correction Injection	‑0.001	Effectively zero – removed from final design

Resources

Code & Data:
Full research article: (link to next article)

End of cleaned markdown.

Status: Formally published as a preprint

Title: HumanPersonaBase: A Language-Agnostic Framework for Human-Like AI Communication

DOI: 10.5281/zenodo.19273577