[Paper] Interpretable Stylistic Variation in Human and LLM Writing Across Genres, Models, and Decoding Strategies

Published: (April 15, 2026 at 01:31 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2604.14111v1

Overview

The paper Interpretable Stylistic Variation in Human and LLM Writing Across Genres, Models, and Decoding Strategies investigates how the “voice” of text changes when it is written by people versus large language models (LLMs). By applying a classic linguistic feature set to millions of sentences, the authors show which stylistic cues survive (or disappear) across different models, genres, prompts, and decoding methods—insights that are directly useful for anyone building or defending LLM‑driven products.

Key Contributions

  • Large‑scale stylistic audit: Analyzed over 11 LLMs (including chat‑based variants) across 8 genres (e.g., news, fiction, academic) and 4 decoding strategies.
  • Feature‑level interpretability: Used Douglas Biber’s lexicogrammatical and functional feature taxonomy (≈ 100 linguistic markers) to quantify style, rather than black‑box embeddings.
  • Robust differentiators: Identified a handful of linguistic features (e.g., nominal density, clause complexity, discourse markers) that consistently separate human from machine text, regardless of prompting tricks.
  • Genre dominates style: Demonstrated that genre exerts a stronger influence on the feature distribution than the source (human vs. LLM).
  • Model‑centric clustering: Found that chat‑oriented models (e.g., ChatGPT, Claude) form tight clusters in the stylistic space, while older “completion” models are more dispersed.
  • Decoding impact hierarchy: Showed that model choice matters more than decoding strategy (temperature, top‑p, nucleus), though certain strategies can amplify or mute specific stylistic cues.

Methodology

  1. Data collection – The authors gathered human‑written corpora for eight well‑defined genres (news, editorial, academic, fiction, etc.) and generated parallel texts with 11 publicly available LLMs. For each model they applied four decoding settings: greedy, temperature‑0.7, top‑p 0.9, and typical‑sampling.
  2. Feature extraction – Using the Biber 1991 framework, they computed ~100 lexical, grammatical, and discourse‑level features (e.g., noun‑phrase density, verb tense variety, connective usage). This approach yields interpretable numbers rather than opaque vector embeddings.
  3. Statistical analysis – Feature vectors were normalized and visualized with PCA and t‑SNE to inspect clustering. ANOVA and mixed‑effects models quantified the relative contribution of source (human vs. LLM), genre, model, and decoding to stylistic variance.
  4. Robustness checks – Prompt engineering experiments (e.g., “write like a human”) and few‑shot continuations were run to test whether LLMs could deliberately mimic human style.

Results & Findings

FactorEffect on Stylistic FeaturesNotable Observations
GenreLargest variance contributor (≈ 45 % of total variance)Same model produces very different styles when switching from news to fiction.
ModelSecond‑largest effect (≈ 30 %)Chat‑based models cluster tightly; older models show more spread.
Decoding strategyModest effect (≈ 10 %)Temperature and top‑p can slightly increase lexical diversity but rarely change high‑level syntactic patterns.
Prompt nudgingMinimal impact on core differentiatorsEven when asked to “write like a human,” LLMs retain higher nominal density and fewer discourse markers.
Key differentiatorsConsistent across conditionsHigher noun‑phrase density, lower use of discourse connectives, and reduced clause embedding depth in LLM output.

In short, style is driven more by what you ask the model to write (genre) and which model you use, rather than how you sample the text.

Practical Implications

  • Content moderation & detection: Security teams can focus on a small set of robust linguistic markers (e.g., connective frequency) to flag synthetic text, even when adversaries tweak prompts or sampling.
  • Prompt engineering: Knowing that genre dominates style, developers can steer LLMs by explicitly setting the genre context (e.g., “Write a news article about X”) rather than fiddling with temperature.
  • Model selection for tone‑sensitive applications: If a product requires a “human‑like” discourse flow (e.g., tutoring bots), choosing a chat‑optimized model is more effective than trying to tune decoding parameters.
  • Fine‑tuning & style transfer: The identified feature set can serve as a loss function for style‑controlled fine‑tuning, enabling developers to push a model toward a target genre’s stylistic fingerprint.
  • Compliance & academic integrity tools: Institutions can integrate lightweight Biber‑feature extractors into plagiarism‑check pipelines to detect AI‑generated essays without heavy neural classifiers.

Limitations & Future Work

  • Feature set coverage: Biber’s taxonomy, while comprehensive, was designed for English prose; it may miss genre‑specific cues in code, multilingual, or highly informal domains (e.g., social media memes).
  • Model diversity: The study focused on 11 publicly released models; emerging open‑source LLMs with different training regimes could exhibit new stylistic patterns.
  • Dynamic prompting: Only static prompts were evaluated; interactive, multi‑turn prompting might allow models to adapt style more fluidly.
  • Real‑world noise: Human corpora were curated and relatively clean; noisy user‑generated content could blur the genre‑style relationship.

Future research directions include extending the feature analysis to multilingual corpora, exploring adaptive prompting strategies that can deliberately shift stylistic dimensions, and integrating these interpretable markers into real‑time detection APIs.

Authors

  • Swati Rallapalli
  • Shannon Gallagher
  • Ronald Yurko
  • Tyler Brooks
  • Chuck Loughin
  • Michele Sezgin
  • Violet Turri

Paper Information

  • arXiv ID: 2604.14111v1
  • Categories: cs.CL
  • Published: April 15, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »