[Paper] Interpretable Stylistic Variation in Human and LLM Writing Across Genres, Models, and Decoding Strategies

Published: 3 weeks ago (April 15, 2026 at 01:31 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.14111v1

Overview

The paper Interpretable Stylistic Variation in Human and LLM Writing Across Genres, Models, and Decoding Strategies investigates how the “voice” of text changes when it is written by people versus large language models (LLMs). By applying a classic linguistic feature set to millions of sentences, the authors show which stylistic cues survive (or disappear) across different models, genres, prompts, and decoding methods—insights that are directly useful for anyone building or defending LLM‑driven products.

Key Contributions

Large‑scale stylistic audit: Analyzed over 11 LLMs (including chat‑based variants) across 8 genres (e.g., news, fiction, academic) and 4 decoding strategies.
Feature‑level interpretability: Used Douglas Biber’s lexicogrammatical and functional feature taxonomy (≈ 100 linguistic markers) to quantify style, rather than black‑box embeddings.
Robust differentiators: Identified a handful of linguistic features (e.g., nominal density, clause complexity, discourse markers) that consistently separate human from machine text, regardless of prompting tricks.
Genre dominates style: Demonstrated that genre exerts a stronger influence on the feature distribution than the source (human vs. LLM).
Model‑centric clustering: Found that chat‑oriented models (e.g., ChatGPT, Claude) form tight clusters in the stylistic space, while older “completion” models are more dispersed.
Decoding impact hierarchy: Showed that model choice matters more than decoding strategy (temperature, top‑p, nucleus), though certain strategies can amplify or mute specific stylistic cues.

Methodology

Data collection – The authors gathered human‑written corpora for eight well‑defined genres (news, editorial, academic, fiction, etc.) and generated parallel texts with 11 publicly available LLMs. For each model they applied four decoding settings: greedy, temperature‑0.7, top‑p 0.9, and typical‑sampling.
Feature extraction – Using the Biber 1991 framework, they computed ~100 lexical, grammatical, and discourse‑level features (e.g., noun‑phrase density, verb tense variety, connective usage). This approach yields interpretable numbers rather than opaque vector embeddings.
Statistical analysis – Feature vectors were normalized and visualized with PCA and t‑SNE to inspect clustering. ANOVA and mixed‑effects models quantified the relative contribution of source (human vs. LLM), genre, model, and decoding to stylistic variance.
Robustness checks – Prompt engineering experiments (e.g., “write like a human”) and few‑shot continuations were run to test whether LLMs could deliberately mimic human style.

Results & Findings

Factor	Effect on Stylistic Features	Notable Observations
Genre	Largest variance contributor (≈ 45 % of total variance)	Same model produces very different styles when switching from news to fiction.
Model	Second‑largest effect (≈ 30 %)	Chat‑based models cluster tightly; older models show more spread.
Decoding strategy	Modest effect (≈ 10 %)	Temperature and top‑p can slightly increase lexical diversity but rarely change high‑level syntactic patterns.
Prompt nudging	Minimal impact on core differentiators	Even when asked to “write like a human,” LLMs retain higher nominal density and fewer discourse markers.
Key differentiators	Consistent across conditions	Higher noun‑phrase density, lower use of discourse connectives, and reduced clause embedding depth in LLM output.

In short, style is driven more by what you ask the model to write (genre) and which model you use, rather than how you sample the text.

Practical Implications

Content moderation & detection: Security teams can focus on a small set of robust linguistic markers (e.g., connective frequency) to flag synthetic text, even when adversaries tweak prompts or sampling.
Prompt engineering: Knowing that genre dominates style, developers can steer LLMs by explicitly setting the genre context (e.g., “Write a news article about X”) rather than fiddling with temperature.
Model selection for tone‑sensitive applications: If a product requires a “human‑like” discourse flow (e.g., tutoring bots), choosing a chat‑optimized model is more effective than trying to tune decoding parameters.
Fine‑tuning & style transfer: The identified feature set can serve as a loss function for style‑controlled fine‑tuning, enabling developers to push a model toward a target genre’s stylistic fingerprint.
Compliance & academic integrity tools: Institutions can integrate lightweight Biber‑feature extractors into plagiarism‑check pipelines to detect AI‑generated essays without heavy neural classifiers.

Limitations & Future Work

Feature set coverage: Biber’s taxonomy, while comprehensive, was designed for English prose; it may miss genre‑specific cues in code, multilingual, or highly informal domains (e.g., social media memes).
Model diversity: The study focused on 11 publicly released models; emerging open‑source LLMs with different training regimes could exhibit new stylistic patterns.
Dynamic prompting: Only static prompts were evaluated; interactive, multi‑turn prompting might allow models to adapt style more fluidly.
Real‑world noise: Human corpora were curated and relatively clean; noisy user‑generated content could blur the genre‑style relationship.

Future research directions include extending the feature analysis to multilingual corpora, exploring adaptive prompting strategies that can deliberately shift stylistic dimensions, and integrating these interpretable markers into real‑time detection APIs.

Authors

Swati Rallapalli
Shannon Gallagher
Ronald Yurko
Tyler Brooks
Chuck Loughin
Michele Sezgin
Violet Turri

Paper Information

arXiv ID: 2604.14111v1
Categories: cs.CL
Published: April 15, 2026
PDF: Download PDF

[Paper] Interpretable Stylistic Variation in Human and LLM Writing Across Genres, Models, and Decoding Strategies

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Learning to Reason with Insight for Informal Theorem Proving

[Paper] No Universal Courtesy: A Cross-Linguistic, Multi-Model Study of Politeness Effects on LLMs Using the PLUM Corpus

[Paper] VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

[Paper] From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text