[Paper] TALES: A Taxonomy and Analysis of Cultural Representations in LLM-generated Stories

Published: (November 26, 2025 at 07:07 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2511.21322v1

Overview

The paper introduces TALES, a systematic study of how large language models (LLMs) portray Indian cultural identities in AI‑generated stories. By building a taxonomy of cultural misrepresentations and evaluating several popular models, the authors reveal that most generated narratives contain cultural errors—especially for less‑resourced languages and peri‑urban settings—while the models themselves often retain the underlying cultural knowledge.

Key Contributions

  • TALES‑Tax: A fine‑grained taxonomy of cultural misrepresentations derived from focus groups and surveys with people who have lived experience across India.
  • Large‑scale annotation effort: 2,925 story annotations collected from 108 annotators representing 71 Indian regions and 14 languages.
  • Empirical audit of six LLMs: Quantifies the prevalence of cultural inaccuracies across models, languages, and geographic story settings.
  • TALES‑QA: A curated question‑answer benchmark that isolates cultural knowledge, enabling direct evaluation of foundational models separate from story‑generation pipelines.
  • Insightful paradox: Models often know the correct cultural facts (as shown by TALES‑QA) yet still produce flawed stories, highlighting a gap between knowledge retrieval and generation.

Methodology

  1. Taxonomy creation – Conducted 9 focus‑group sessions and 15 individual surveys with participants from diverse Indian backgrounds. Their feedback was distilled into a hierarchical taxonomy (e.g., attire, food, festivals, social norms, dialectal cues).
  2. Story generation – Prompted six LLMs (including both open‑source and commercial APIs) to write short stories about characters situated in various Indian regions and languages.
  3. Annotation pipeline – Recruited 108 annotators who personally identify with the cultures depicted. Each story was reviewed for the presence of taxonomy‑defined misrepresentations, yielding 2,925 labeled instances.
  4. Quantitative analysis – Measured error rates across models, language resource levels (high vs. low), and story settings (urban, peri‑urban, rural).
  5. Knowledge probing – Converted the taxonomy items into 1,200 multiple‑choice questions (TALES‑QA) and evaluated the same models on pure factual recall, independent of story generation.

Results & Findings

  • 88 % of generated stories contain at least one cultural inaccuracy.
  • Error frequency is higher for mid‑ and low‑resource Indian languages (e.g., Marathi, Bengali) compared to high‑resource ones (e.g., Hindi, English).
  • Stories set in peri‑urban regions show the greatest misrepresentation rates, suggesting models are biased toward stereotypical urban narratives.
  • On TALES‑QA, many models achieve 70‑85 % accuracy, indicating they possess the factual cultural knowledge.
  • The discrepancy implies that the generation pipeline (prompt handling, decoding strategies) often fails to surface the correct knowledge.

Practical Implications

  • Product teams building AI‑driven storytelling, chatbots, or virtual assistants for Indian markets should integrate a cultural‑validation layer (e.g., post‑generation checks using TALES‑Tax or TALES‑QA).
  • Prompt engineering: Explicitly specifying cultural details (region, language, customs) can mitigate some errors, but systematic safeguards are still needed.
  • Fine‑tuning & RLHF: Incorporating culturally diverse, high‑quality datasets and reinforcement learning from culturally‑aware human feedback can close the knowledge‑generation gap.
  • Localization pipelines: For multilingual products, prioritize higher‑quality data and evaluation for low‑resource languages to avoid perpetuating stereotypes.
  • Compliance & ethics: Companies can use TALES‑Tax as an audit checklist to demonstrate responsible AI practices when deploying LLMs in culturally sensitive contexts.

Limitations & Future Work

  • The study focuses exclusively on Indian cultural identities; extending the taxonomy to other regions will be necessary for global applicability.
  • Annotation relied on self‑reported lived experience, which, while valuable, may not capture the full spectrum of intra‑regional variation.
  • Only six models were examined; newer or more specialized LLMs could behave differently.
  • Future research could explore automated detection of cultural misrepresentations, integrate real‑time correction mechanisms, and assess the impact of instruction tuning on reducing these errors.

Authors

  • Kirti Bhagat
  • Shaily Bhatt
  • Athul Velagapudi
  • Aditya Vashistha
  • Shachi Dave
  • Danish Pruthi

Paper Information

  • arXiv ID: 2511.21322v1
  • Categories: cs.HC, cs.AI, cs.CL, cs.CY
  • Published: November 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »