Dissecting the Humanization Pipeline for AI Text: A 6-Step Ablation Study

Published: (March 28, 2026 at 01:33 AM EDT)
5 min read
Source: Dev.to

Source: Dev.to

Ablation Study: Which Transformation Steps Really Matter?

“The scores are good. But what’s actually working?”

In the previous article I built a pipeline to make AI‑generated text feel more human‑like and reported benchmark results of Mean Alignment = 0.945 and Distribution Alignment = 0.864. Those numbers look solid, but they don’t tell us which of the six transformation steps are actually contributing and which are just noise.

What I Did

I performed an ablation (removal) study: each step was disabled one‑by‑one and the pipeline was re‑evaluated on a held‑out test set of 500 samples (80 % / 20 % split).

Below are the results.

Results Table

Disabled StepMean AlignmentDistribution AlignmentMean DropDist. Drop
None (Full Pipeline)0.9450.864
Filler Insertion0.6220.569‑0.323‑0.296
Long Sentence Splitting0.7510.720‑0.194‑0.144
Short Sentence Insertion (interjection)0.7630.742‑0.182‑0.122
Hedge Injection0.8080.740‑0.137‑0.125
Cushion Injection0.8510.779‑0.094‑0.085
Self‑Correction Injection0.9440.866‑0.001+0.001
No Pipeline0.0030.000‑0.942‑0.864

Key take‑aways:

  1. Filler insertion caused the biggest collapse (‑0.323).
  2. Long‑sentence splitting and short‑sentence insertion together contributed more than fillers (‑0.376).
  3. Self‑correction injection had virtually no impact.

Deep Dive into the Surprises

1️⃣ Filler Insertion – The Biggest Contributor (‑0.323)

MetricHuman TextAI TextCohen’s d
Filler Rate (per sentence)0.1650.0011.755 (very large)

Humans regularly use fillers such as “Well,”, “You know,” or “Basically,” (≈ 1 per 6 sentences). AI almost never does, making filler rate the strongest single discriminator.

The False‑Positive Pitfall

My initial implementation flagged any occurrence of the word “like” (\blike\b) as a filler. This mistakenly counted “I like pizza” and inflated the filler rate to > 0.3, leading me to (incorrectly) conclude that humans over‑use fillers.

Fix: Switch to position‑dependent detection:

# NG: Riddled with false positives
FILLER_PATTERNS = [r"\blike\b", r"\bso\b", r"\bwell\b"]

# OK: Detects only filler usage at sentence start + comma
FILLER_START_PATTERNS = [r"^(?:well|so|like)\s*,"]
FILLER_ALWAYS = [r"\byou know\b", r"\bi mean\b", r"\bbasically\b"]

Lesson: In quantitative NLP work, eliminate regex false positives before drawing conclusions.

2️⃣ Self‑Correction Injection – A Failure (‑0.001)

Self‑correction markers (“wait, I mean…”, “sorry, what I meant was…”) barely appear in human business communication (0.19 % / sentence, weight = 0.097). With only 500 samples, the confidence interval is wide ([0.001, 0.004]), burying any effect in noise.

Result: The step was removed from the final pipeline.

3️⃣ Long‑Sentence Splitting & Short‑Sentence Insertion

StepPrimary EffectMechanism
Long Sentence SplittingReduces Words / SentenceCuts average length from 18 → 13 words
Short Sentence InsertionIncreases Sentence‑Length CVAdds brief interjections (“Hmm.”, “Got it.”)

AI tends to produce uniformly long sentences; humans mix short acknowledgments with longer explanations. Together these steps contribute ‑0.376, surpassing the filler contribution.

Metric Weights (Derived from Discriminative Power)

MetricCohen’s dWeight
Filler Rate1.7551.88
Words / Sentence1.3561.45
Sentence‑Length CV1.0861.16
Hedge Rate0.8180.87
Cushion Rate0.5060.54
Self‑Correction Rate0.0910.10

Effect size d > 0.8 denotes a large effect; thus filler rate, words/sentence, and sentence‑length CV dominate human/AI discrimination.

Limitations & Future Work

  • Context‑dependence: The current pipeline injects fillers and hedges at a fixed probability. In reality, filler usage varies by topic (more in casual conversation, less in technical writing). This mismatch caused two metrics to fail the KS test.
  • Automated vs. Human Evaluation: The DPO benchmark rewards superficial feature matching (e.g., presence of fillers or typos) but does not guarantee that a human reader feels the text is human‑written. Human evaluation remains essential.
  • Sample Size: With only 500 test samples, rare phenomena (e.g., self‑corrections) are hard to assess reliably.

Bottom‑Line Ranking of Steps

RankStepContribution (Mean Drop)Take‑away
1Filler Insertion‑0.323Most critical – watch for false positives
2Long Sentence Splitting‑0.194Aligns words‑per‑sentence to human level
3Short Sentence Insertion‑0.182Introduces natural sentence‑length variation
4Hedge Injection‑0.137Adds ambiguity, modest impact
5Cushion Injection‑0.094Inserts polite prefacing (“Sure,”, “Of course,”)
6Self‑Correction Injection‑0.001Effectively zero – removed from final design

Resources

  • Code & Data:
  • Full research article: (link to next article)

End of cleaned markdown.

Status: Formally published as a preprint

Title: HumanPersonaBase: A Language-Agnostic Framework for Human-Like AI Communication

DOI: 10.5281/zenodo.19273577

0 views
Back to Blog

Related posts

Read more »

Life With AI Causing Human Brain 'Fry'

fjo3 shares a report from France 24: Too many lines of code to analyze, armies of AI assistants to wrangle, and lengthy prompts to draft are among the laments b...