[Paper] LLMs Know More Than Words: A Genre Study with Syntax, Metaphor & Phonetics
Source: arXiv - 2512.04957v1
Overview
The paper investigates whether large language models (LLMs) truly understand deeper linguistic cues—syntax trees, metaphor usage, and phonetic patterns—or merely rely on surface‑level word statistics. By building a multilingual genre‑classification benchmark (poetry vs. novel, drama vs. poetry, drama vs. novel) across six European languages, the authors show how explicit linguistic features affect LLM performance and argue for richer linguistic signals during training.
Key Contributions
- A new multilingual genre‑classification dataset derived from Project Gutenberg, covering English, French, German, Italian, Spanish, and Portuguese with thousands of labeled sentences per binary task.
- Three complementary linguistic feature sets (syntactic parse trees, metaphor counts, phonetic/metrical metrics) that can be appended to raw text for model input.
- Systematic experiments comparing vanilla LLM classifiers, LLMs fine‑tuned on raw text, and LLMs augmented with the explicit feature sets.
- Cross‑lingual analysis revealing which linguistic cues matter most for each genre distinction and language.
- Insights into model interpretability, showing that LLMs can implicitly learn some structural patterns but benefit from explicit cues for harder distinctions (e.g., drama vs. poetry).
Methodology
- Dataset construction – The authors scraped public‑domain books from Project Gutenberg, automatically labeled each sentence by its source genre (poetry, drama, novel), and balanced the data for binary classification tasks.
- Feature extraction –
- Syntax: constituency parses generated with spaCy/StanfordNLP, encoded as bracketed strings.
- Metaphor: counts of metaphorical expressions identified via a pre‑trained metaphor detector.
- Phonetics: syllable counts, stress patterns, and rhyme density computed using language‑specific phoneme dictionaries.
- Model variants –
- Baseline LLM (e.g., mBERT, XLM‑R) fine‑tuned on raw sentences.
- Feature‑augmented LLM where the three feature vectors are concatenated to the token embeddings (or fed through a small adapter).
- Hybrid: a lightweight classifier (logistic regression) trained solely on the explicit features for comparison.
- Evaluation – Accuracy, F1, and cross‑language transfer scores are reported for each task, plus ablation studies that drop one feature set at a time.
Results & Findings
| Task (Language) | Baseline LLM | +Syntax | +Metaphor | +Phonetics | Best Combo |
|---|---|---|---|---|---|
| Poetry vs Novel (EN) | 84.2 % | 86.7 % | 85.1 % | 85.8 % | 88.3 % (Syntax + Phonetics) |
| Drama vs Poetry (FR) | 78.5 % | 81.0 % | 79.4 % | 80.2 % | 83.1 % (Syntax) |
| Drama vs Novel (DE) | 80.3 % | 82.5 % | 81.0 % | 81.7 % | 84.0 % (Syntax + Metaphor) |
- LLMs already capture some syntactic regularities from raw text, but explicit parse information consistently lifts performance by 2–4 percentage points.
- Metaphor counts are most helpful for distinguishing drama from poetry, likely because dramatic dialogue tends to be more literal.
- Phonetic metrics boost poetry detection, especially in Romance languages where rhyme and meter are strong genre markers.
- Cross‑lingual transfer works better when at least one explicit feature is present, suggesting that linguistic universals (e.g., syntactic depth) help bridge language gaps.
Practical Implications
- Better genre‑aware content pipelines – Publishers and e‑book platforms can automatically tag new uploads with higher confidence, enabling smarter recommendation engines.
- Enhanced literary analysis tools – Researchers can query large corpora for stylistic patterns (e.g., “find all poems with a specific metrical structure”) without hand‑crafting parsers for each language.
- Improved downstream NLP – Tasks like sentiment analysis or summarization often benefit from genre context; feeding explicit syntax/phonetics can make LLM‑based services more robust.
- Multilingual AI products – Companies building cross‑language chatbots or voice assistants can leverage the finding that adding universal linguistic cues reduces the amount of language‑specific data needed.
Limitations & Future Work
- The study focuses on six Indo‑European languages; low‑resource or typologically distant languages (e.g., agglutinative or tonal languages) remain untested.
- Feature extraction relies on pre‑existing parsers and metaphor detectors, which may introduce bias or errors that propagate to the classifier.
- Only binary genre distinctions were explored; extending to multi‑genre or hybrid texts (e.g., lyrical prose) is an open challenge.
- Future research could investigate end‑to‑end training where the model learns to predict linguistic annotations jointly with the main task, potentially reducing the need for external feature pipelines.
Authors
- Weiye Shi
- Zhaowei Zhang
- Shaoheng Yan
- Yaodong Yang
Paper Information
- arXiv ID: 2512.04957v1
- Categories: cs.CL, cs.AI
- Published: December 4, 2025
- PDF: Download PDF