Dissecting the Humanization Pipeline for AI Text: A 6-Step Ablation Study
Source: Dev.to
Ablation Study: Which Transformation Steps Really Matter?
“The scores are good. But what’s actually working?”
In the previous article I built a pipeline to make AI‑generated text feel more human‑like and reported benchmark results of Mean Alignment = 0.945 and Distribution Alignment = 0.864. Those numbers look solid, but they don’t tell us which of the six transformation steps are actually contributing and which are just noise.
What I Did
I performed an ablation (removal) study: each step was disabled one‑by‑one and the pipeline was re‑evaluated on a held‑out test set of 500 samples (80 % / 20 % split).
Below are the results.
Results Table
| Disabled Step | Mean Alignment | Distribution Alignment | Mean Drop | Dist. Drop |
|---|---|---|---|---|
| None (Full Pipeline) | 0.945 | 0.864 | — | — |
| Filler Insertion | 0.622 | 0.569 | ‑0.323 | ‑0.296 |
| Long Sentence Splitting | 0.751 | 0.720 | ‑0.194 | ‑0.144 |
| Short Sentence Insertion (interjection) | 0.763 | 0.742 | ‑0.182 | ‑0.122 |
| Hedge Injection | 0.808 | 0.740 | ‑0.137 | ‑0.125 |
| Cushion Injection | 0.851 | 0.779 | ‑0.094 | ‑0.085 |
| Self‑Correction Injection | 0.944 | 0.866 | ‑0.001 | +0.001 |
| No Pipeline | 0.003 | 0.000 | ‑0.942 | ‑0.864 |
Key take‑aways:
- Filler insertion caused the biggest collapse (‑0.323).
- Long‑sentence splitting and short‑sentence insertion together contributed more than fillers (‑0.376).
- Self‑correction injection had virtually no impact.
Deep Dive into the Surprises
1️⃣ Filler Insertion – The Biggest Contributor (‑0.323)
| Metric | Human Text | AI Text | Cohen’s d |
|---|---|---|---|
| Filler Rate (per sentence) | 0.165 | 0.001 | 1.755 (very large) |
Humans regularly use fillers such as “Well,”, “You know,” or “Basically,” (≈ 1 per 6 sentences). AI almost never does, making filler rate the strongest single discriminator.
The False‑Positive Pitfall
My initial implementation flagged any occurrence of the word “like” (\blike\b) as a filler. This mistakenly counted “I like pizza” and inflated the filler rate to > 0.3, leading me to (incorrectly) conclude that humans over‑use fillers.
Fix: Switch to position‑dependent detection:
# NG: Riddled with false positives
FILLER_PATTERNS = [r"\blike\b", r"\bso\b", r"\bwell\b"]
# OK: Detects only filler usage at sentence start + comma
FILLER_START_PATTERNS = [r"^(?:well|so|like)\s*,"]
FILLER_ALWAYS = [r"\byou know\b", r"\bi mean\b", r"\bbasically\b"]Lesson: In quantitative NLP work, eliminate regex false positives before drawing conclusions.
2️⃣ Self‑Correction Injection – A Failure (‑0.001)
Self‑correction markers (“wait, I mean…”, “sorry, what I meant was…”) barely appear in human business communication (0.19 % / sentence, weight = 0.097). With only 500 samples, the confidence interval is wide ([0.001, 0.004]), burying any effect in noise.
Result: The step was removed from the final pipeline.
3️⃣ Long‑Sentence Splitting & Short‑Sentence Insertion
| Step | Primary Effect | Mechanism |
|---|---|---|
| Long Sentence Splitting | Reduces Words / Sentence | Cuts average length from 18 → 13 words |
| Short Sentence Insertion | Increases Sentence‑Length CV | Adds brief interjections (“Hmm.”, “Got it.”) |
AI tends to produce uniformly long sentences; humans mix short acknowledgments with longer explanations. Together these steps contribute ‑0.376, surpassing the filler contribution.
Metric Weights (Derived from Discriminative Power)
| Metric | Cohen’s d | Weight |
|---|---|---|
| Filler Rate | 1.755 | 1.88 |
| Words / Sentence | 1.356 | 1.45 |
| Sentence‑Length CV | 1.086 | 1.16 |
| Hedge Rate | 0.818 | 0.87 |
| Cushion Rate | 0.506 | 0.54 |
| Self‑Correction Rate | 0.091 | 0.10 |
Effect size d > 0.8 denotes a large effect; thus filler rate, words/sentence, and sentence‑length CV dominate human/AI discrimination.
Limitations & Future Work
- Context‑dependence: The current pipeline injects fillers and hedges at a fixed probability. In reality, filler usage varies by topic (more in casual conversation, less in technical writing). This mismatch caused two metrics to fail the KS test.
- Automated vs. Human Evaluation: The DPO benchmark rewards superficial feature matching (e.g., presence of fillers or typos) but does not guarantee that a human reader feels the text is human‑written. Human evaluation remains essential.
- Sample Size: With only 500 test samples, rare phenomena (e.g., self‑corrections) are hard to assess reliably.
Bottom‑Line Ranking of Steps
| Rank | Step | Contribution (Mean Drop) | Take‑away |
|---|---|---|---|
| 1 | Filler Insertion | ‑0.323 | Most critical – watch for false positives |
| 2 | Long Sentence Splitting | ‑0.194 | Aligns words‑per‑sentence to human level |
| 3 | Short Sentence Insertion | ‑0.182 | Introduces natural sentence‑length variation |
| 4 | Hedge Injection | ‑0.137 | Adds ambiguity, modest impact |
| 5 | Cushion Injection | ‑0.094 | Inserts polite prefacing (“Sure,”, “Of course,”) |
| 6 | Self‑Correction Injection | ‑0.001 | Effectively zero – removed from final design |
Resources
- Code & Data:
- Full research article: (link to next article)
End of cleaned markdown.
Status: Formally published as a preprint
Title: HumanPersonaBase: A Language-Agnostic Framework for Human-Like AI Communication