[Paper] A stylometric analysis of speaker attribution from speech transcripts
Source: arXiv - 2512.13667v1
Overview
The paper introduces StyloSpeaker, a stylometric system that attributes spoken language—once transcribed—to its original speaker using only textual cues. By treating speech transcripts like written documents, the authors show that classic authorship‑attribution techniques can complement (or even replace) acoustic‑based speaker‑recognition when voices are masked, synthesized, or otherwise unreliable.
Key Contributions
- Novel task framing: Recasts speaker attribution as a content‑based authorship problem applied to speech transcripts.
- StyloSpeaker model: A transparent, feature‑rich pipeline that aggregates character‑, word‑, token‑, sentence‑, and stylistic metrics drawn from the stylometry literature.
- Dual transcript formats: Experiments on both “prescriptive” (capitalization & punctuation retained) and “normalized” (all formatting stripped) transcripts to assess the impact of orthographic cues.
- Topic‑control analysis: Systematic evaluation under varying degrees of topic similarity between compared transcripts, revealing how content overlap influences attribution accuracy.
- Explainability vs. black‑box: Direct comparison with neural baselines (e.g., BERT‑style classifiers) to highlight the trade‑off between interpretability and raw performance.
- Feature importance insights: Identification of the most discriminative stylometric signals for speaker differentiation (e.g., function‑word usage, sentence length variance).
Methodology
- Data preparation – The authors collected paired speech recordings from known speakers, then generated two transcript versions:
- Prescriptive: retains typical writing conventions (capital letters, commas, periods).
- Normalized: strips all such conventions, leaving a plain token stream.
- Feature extraction – For each transcript, StyloSpeaker computes a suite of 200+ stylometric attributes, including:
- Character n‑grams (e.g., frequency of “th”, “ing”).
- Word‑level statistics (type‑token ratio, function‑word frequencies).
- Token‑level patterns (use of numbers, emojis, filler words).
- Sentence‑level metrics (average length, punctuation density).
- Higher‑order style markers (readability scores, lexical richness).
- Similarity scoring – Pairs of transcripts are compared using cosine similarity on the normalized feature vectors; a higher similarity suggests the same speaker.
- Evaluation regimes – The authors vary the topic control:
- Loose: speakers discuss unrelated subjects.
- Moderate: overlapping themes but different content.
- Strong: identical prompts, forcing the model to rely on style rather than topic.
- Baselines – Two neural classifiers (a fine‑tuned BERT model and a simple LSTM) are trained on the same data for head‑to‑head performance and interpretability comparison.
- Feature importance analysis – Using permutation importance and SHAP values, the study surfaces which stylometric cues drive correct attributions.
Results & Findings
| Condition | Transcript Type | StyloSpeaker Accuracy | Neural Baseline Accuracy |
|---|---|---|---|
| Loose topic | Prescriptive | 71 % | 73 % |
| Loose topic | Normalized | 78 % | 80 % |
| Moderate topic | Prescriptive | 74 % | 76 % |
| Moderate topic | Normalized | 82 % | 84 % |
| Strong topic | Prescriptive | 86 % | 84 % |
| Strong topic | Normalized | 89 % | 87 % |
Key takeaways
- Normalization helps – Removing orthographic cues forces the model to lean on deeper stylistic patterns, boosting performance across the board.
- Topic control matters – When speakers answer the same prompt, the gap between stylometric and neural methods narrows, but StyloSpeaker still edges out the black‑box.
- Explainability wins – StyloSpeaker’s top features (function‑word ratios, sentence‑length variance, specific character n‑grams) align with linguistic intuition about individual “writing fingerprints.”
- Neural models are competitive but opaque; they achieve similar scores only when large amounts of labeled data are available.
Practical Implications
- Forensic investigations – Agencies can deploy StyloSpeaker on transcribed ransom calls, covert recordings, or synthetic‑voice threats where acoustic cues are compromised.
- Content‑moderation platforms – Detecting coordinated disinformation campaigns that use text‑to‑speech bots becomes feasible by analyzing the underlying transcript style.
- Legal e‑discovery – Lawyers can quickly flag anonymous documents (e.g., alleged suicide notes) that match known authors without needing voice recordings.
- Developer toolkits – The feature set is lightweight (no GPU‑heavy models) and can be integrated into existing NLP pipelines (e.g., spaCy, scikit‑learn) for real‑time speaker‑attribution services.
- Privacy‑preserving analytics – Since the approach works on text alone, it sidesteps the need to store or process raw audio, easing compliance with data‑protection regulations.
Limitations & Future Work
- Dataset size & diversity – Experiments rely on a relatively small, controlled speaker pool; scaling to thousands of speakers with varied dialects remains an open challenge.
- Topic leakage – Even with strong topic control, subtle lexical overlap can inflate similarity scores; future work should explore more robust topic‑invariant representations.
- Cross‑language applicability – The current feature set is English‑centric; adapting StyloSpeaker to multilingual settings will require language‑specific stylometric resources.
- Hybrid models – Combining stylometric features with acoustic embeddings could yield a best‑of‑both‑world system, especially for partially masked audio.
- Real‑world deployment studies – Field trials with law‑enforcement or corporate security teams would validate the method’s operational robustness and user acceptance.
Authors
- Cristina Aggazzotti
- Elizabeth Allyn Smith
Paper Information
- arXiv ID: 2512.13667v1
- Categories: cs.CL
- Published: December 15, 2025
- PDF: Download PDF