[Paper] Mary, the Cheeseburger-Eating Vegetarian: Do LLMs Recognize Incoherence in Narratives?
Source: arXiv - 2512.07777v1
Overview
The paper “Mary, the Cheeseburger‑Eating Vegetarian: Do LLMs Recognize Incoherence in Narratives?” asks a deceptively simple question: can today’s large language models (LLMs) tell when a story doesn’t make sense? By pairing coherent and subtly incoherent short narratives, the authors probe whether LLMs’ internal representations and their outward responses align when it comes to spotting narrative breaks.
Key Contributions
- Dataset of paired narratives – 2,000 short stories where each coherent version has a minimally altered incoherent counterpart (e.g., a character acting against an established trait).
- Representation probing – Demonstrates that hidden states of several popular LLMs (GPT‑3.5, Llama‑2, Claude) encode enough signal to discriminate coherent vs. incoherent texts with >80 % accuracy.
- Behavioral evaluation – Shows that when asked to rate story coherence, LLMs often fail to separate the two versions, even with varied prompts and chain‑of‑thought reasoning.
- Fine‑grained analysis of incoherence types – Finds that models are more sensitive to setting‑level violations (e.g., “rainy day in the desert”) than to character‑level trait violations (e.g., “vegetarian orders a cheeseburger”).
- Insight into the “representation‑behavior gap” – Highlights that strong internal signals do not automatically translate into reliable, user‑facing judgments.
Methodology
- Story Construction – Human annotators wrote short, self‑contained narratives (≈150 words). For each story, a single sentence was altered to create an incoherent version while keeping the rest identical.
- LLM Probing – Hidden layer activations were extracted from the final token of each story. A lightweight linear classifier was trained on a small labeled subset to predict coherence.
- Prompt‑Based Rating – The same LLMs were then asked, via zero‑shot and few‑shot prompts, to rate “How coherent is this story?” on a 1‑5 scale. Variations included direct questions, multiple‑choice formats, and chain‑of‑thought (CoT) reasoning prompts.
- Incoherence Typology – Two categories were examined: setting violations (world‑knowledge contradictions) and character‑trait violations (behavioral inconsistencies).
- Evaluation Metrics – Classification accuracy for probing, correlation (Spearman ρ) between model ratings and ground‑truth labels, and statistical significance of differences across incoherence types.
Results & Findings
| Evaluation | Coherent | Incoherent | Gap |
|---|---|---|---|
| Probing accuracy (linear classifier on hidden states) | 84 % (GPT‑3.5) | 86 % (Llama‑2) | – |
| Rating correlation (prompt‑based) | 0.31 (GPT‑3.5) | 0.12 (GPT‑3.5) | Low – models often give similar scores to both versions |
| Effect of prompt style | Slight improvement with CoT (↑ 0.05) but still insufficient | – | – |
| Setting vs. trait violations | Detectable 70 % of the time for setting violations | Detectable only 45 % of the time for trait violations | Indicates reliance on prototypical world knowledge |
Takeaway: LLMs “know” that something is off when you look inside the model, but they rarely express that knowledge when asked directly. Their judgments are biased toward obvious world‑knowledge mismatches and overlook subtler character‑consistency breaks.
Practical Implications
- Content‑generation tools – Automated story‑writing assistants (e.g., AI Dungeon, marketing copy generators) may produce narratives that feel coherent to the model but contain hidden inconsistencies that human readers will spot. Developers should add external consistency checks (e.g., rule‑based trait trackers) rather than relying on the LLM’s own rating.
- Fact‑checking & QA pipelines – The representation‑behavior gap suggests that internal embeddings can be repurposed for anomaly detection (e.g., flagging contradictory statements) even if the model’s surface answer is vague.
- Prompt engineering – Simple rating prompts are unreliable; richer, multi‑step reasoning prompts (CoT) improve but do not close the gap. Teams building conversational agents should treat LLM self‑assessment as a soft signal, not a definitive verdict.
- Narrative AI research – The asymmetry between setting and trait violations points to a need for more nuanced world‑modeling (e.g., explicit character state representations) if we want LLMs to understand story logic like humans do.
Limitations & Future Work
- Scale of narratives – The study uses short, single‑paragraph stories; longer, multi‑scene narratives may exhibit different coherence dynamics.
- Model diversity – Only a handful of publicly available LLMs were examined; newer instruction‑tuned or retrieval‑augmented models might behave differently.
- Human baseline – The paper does not report a direct human‑vs‑model comparison on the rating task, leaving open how far the gap is from expert judgment.
- Future directions – The authors suggest integrating explicit narrative schemas, memory modules for character traits, and training objectives that directly penalize incoherent generations.
Bottom line: While LLMs embed strong signals that a story is off‑kilter, they often fail to surface that insight when asked. For developers building AI‑driven storytelling or consistency‑checking tools, this means supplementing LLM outputs with dedicated coherence validators and being cautious about trusting the model’s self‑assessment.
Authors
- Karin de Langis
- Püren Öncel
- Ryan Peters
- Andrew Elfenbein
- Laura Kristen Allen
- Andreas Schramm
- Dongyeop Kang
Paper Information
- arXiv ID: 2512.07777v1
- Categories: cs.CL
- Published: December 8, 2025
- PDF: Download PDF