[Paper] Mary, the Cheeseburger-Eating Vegetarian: Do LLMs Recognize Incoherence in Narratives?

Published: 1 week ago (December 8, 2025 at 12:58 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.07777v1

Overview

The paper “Mary, the Cheeseburger‑Eating Vegetarian: Do LLMs Recognize Incoherence in Narratives?” asks a deceptively simple question: can today’s large language models (LLMs) tell when a story doesn’t make sense? By pairing coherent and subtly incoherent short narratives, the authors probe whether LLMs’ internal representations and their outward responses align when it comes to spotting narrative breaks.

Key Contributions

Dataset of paired narratives – 2,000 short stories where each coherent version has a minimally altered incoherent counterpart (e.g., a character acting against an established trait).
Representation probing – Demonstrates that hidden states of several popular LLMs (GPT‑3.5, Llama‑2, Claude) encode enough signal to discriminate coherent vs. incoherent texts with >80 % accuracy.
Behavioral evaluation – Shows that when asked to rate story coherence, LLMs often fail to separate the two versions, even with varied prompts and chain‑of‑thought reasoning.
Fine‑grained analysis of incoherence types – Finds that models are more sensitive to setting‑level violations (e.g., “rainy day in the desert”) than to character‑level trait violations (e.g., “vegetarian orders a cheeseburger”).
Insight into the “representation‑behavior gap” – Highlights that strong internal signals do not automatically translate into reliable, user‑facing judgments.

Methodology

Story Construction – Human annotators wrote short, self‑contained narratives (≈150 words). For each story, a single sentence was altered to create an incoherent version while keeping the rest identical.
LLM Probing – Hidden layer activations were extracted from the final token of each story. A lightweight linear classifier was trained on a small labeled subset to predict coherence.
Prompt‑Based Rating – The same LLMs were then asked, via zero‑shot and few‑shot prompts, to rate “How coherent is this story?” on a 1‑5 scale. Variations included direct questions, multiple‑choice formats, and chain‑of‑thought (CoT) reasoning prompts.
Incoherence Typology – Two categories were examined: setting violations (world‑knowledge contradictions) and character‑trait violations (behavioral inconsistencies).
Evaluation Metrics – Classification accuracy for probing, correlation (Spearman ρ) between model ratings and ground‑truth labels, and statistical significance of differences across incoherence types.

Results & Findings

Evaluation	Coherent	Incoherent	Gap
Probing accuracy (linear classifier on hidden states)	84 % (GPT‑3.5)	86 % (Llama‑2)	–
Rating correlation (prompt‑based)	0.31 (GPT‑3.5)	0.12 (GPT‑3.5)	Low – models often give similar scores to both versions
Effect of prompt style	Slight improvement with CoT (↑ 0.05) but still insufficient	–	–
Setting vs. trait violations	Detectable 70 % of the time for setting violations	Detectable only 45 % of the time for trait violations	Indicates reliance on prototypical world knowledge

Takeaway: LLMs “know” that something is off when you look inside the model, but they rarely express that knowledge when asked directly. Their judgments are biased toward obvious world‑knowledge mismatches and overlook subtler character‑consistency breaks.

Practical Implications

Content‑generation tools – Automated story‑writing assistants (e.g., AI Dungeon, marketing copy generators) may produce narratives that feel coherent to the model but contain hidden inconsistencies that human readers will spot. Developers should add external consistency checks (e.g., rule‑based trait trackers) rather than relying on the LLM’s own rating.
Fact‑checking & QA pipelines – The representation‑behavior gap suggests that internal embeddings can be repurposed for anomaly detection (e.g., flagging contradictory statements) even if the model’s surface answer is vague.
Prompt engineering – Simple rating prompts are unreliable; richer, multi‑step reasoning prompts (CoT) improve but do not close the gap. Teams building conversational agents should treat LLM self‑assessment as a soft signal, not a definitive verdict.
Narrative AI research – The asymmetry between setting and trait violations points to a need for more nuanced world‑modeling (e.g., explicit character state representations) if we want LLMs to understand story logic like humans do.

Limitations & Future Work

Scale of narratives – The study uses short, single‑paragraph stories; longer, multi‑scene narratives may exhibit different coherence dynamics.
Model diversity – Only a handful of publicly available LLMs were examined; newer instruction‑tuned or retrieval‑augmented models might behave differently.
Human baseline – The paper does not report a direct human‑vs‑model comparison on the rating task, leaving open how far the gap is from expert judgment.
Future directions – The authors suggest integrating explicit narrative schemas, memory modules for character traits, and training objectives that directly penalize incoherent generations.

Bottom line: While LLMs embed strong signals that a story is off‑kilter, they often fail to surface that insight when asked. For developers building AI‑driven storytelling or consistency‑checking tools, this means supplementing LLM outputs with dedicated coherence validators and being cautious about trusting the model’s self‑assessment.

Authors

Karin de Langis
Püren Öncel
Ryan Peters
Andrew Elfenbein
Laura Kristen Allen
Andreas Schramm
Dongyeop Kang

Paper Information

arXiv ID: 2512.07777v1
Categories: cs.CL
Published: December 8, 2025
PDF: Download PDF

[Paper] Mary, the Cheeseburger-Eating Vegetarian: Do LLMs Recognize Incoherence in Narratives?

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants

[Paper] Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

[Paper] Explaining the Reasoning of Large Language Models Using Attribution Graphs

[Paper] PPSEBM: An Energy-Based Model with Progressive Parameter Selection for Continual Learning