[Paper] Unmasking the Factual-Conceptual Gap in Persian Language Models

Published: (February 19, 2026 at 01:42 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.17623v1

Overview

The paper “Unmasking the Factual‑Conceptual Gap in Persian Language Models” investigates a hidden weakness in Persian‑language large language models (LLMs): they can often recall cultural facts but struggle to apply that knowledge when reasoning about social norms, superstitions, and customs. By introducing a new diagnostic benchmark, DivanBench, the authors expose systematic biases and reasoning failures that have direct consequences for any product that relies on culturally aware Persian NLP.

Key Contributions

  • DivanBench: a 315‑question benchmark covering three task formats (pure fact retrieval, paired‑scenario verification, and situational reasoning) focused on Persian superstitions, customs, and context‑dependent social rules.
  • Comprehensive evaluation of seven publicly available Persian LLMs, revealing consistent patterns of error across model sizes and training regimes.
  • Identification of an “acquiescence bias”: models readily accept culturally appropriate actions but systematically fail to reject clearly inappropriate ones.
  • Evidence that continued Persian pre‑training can worsen reasoning ability, amplifying the bias rather than improving cultural understanding.
  • Quantification of a 21 % “factual‑conceptual gap”: the performance drop when moving from fact recall to applying those facts in realistic scenarios.

Methodology

  1. Benchmark Design – The authors curated culturally rich items (e.g., “Is it acceptable to eat garlic before a wedding?”) and split them into three formats:
    • Factual Retrieval: direct question‑answer pairs requiring a single fact.
    • Paired Scenario Verification: two contrasting statements, one correct, one violating a norm; the model must pick the correct one.
    • Situational Reasoning: multi‑step prompts that ask the model to reason about a scenario using the retrieved fact.
  2. Model Selection – Seven Persian LLMs were tested, ranging from base‑size to instruction‑tuned variants, including models that have undergone additional Persian‑language pre‑training.
  3. Evaluation Protocol – Accuracy was measured for each task type. For the paired and situational tasks, the authors also computed a bias score that captures the tendency to always choose the culturally “positive” option.
  4. Analysis – Performance gaps were broken down by model size, training data volume, and whether the model had been instruction‑tuned, allowing the authors to isolate the impact of continued monolingual pre‑training.

Results & Findings

  • Acquiescence Bias: Across all models, accuracy on “positive” (acceptable) scenarios was ~85 % while accuracy on “negative” (unacceptable) scenarios lagged at ~45 %.
  • Pre‑training Paradox: Models that received extra Persian pre‑training showed a ~7 % increase in bias and a ~3 % drop in overall situational‑reasoning accuracy compared with their base counterparts.
  • Factual‑Conceptual Gap: While average factual‑retrieval accuracy hovered around 78 %, situational‑reasoning accuracy fell to 57 %, a 21 % gap that persisted even for the largest models.
  • Instruction‑Tuning Helps Slightly: Instruction‑tuned variants reduced bias by ~5 % but still fell short of closing the factual‑conceptual gap.

Practical Implications

  • Chatbots & Virtual Assistants – Deploying Persian LLMs in customer‑facing bots without addressing this bias could lead to socially tone‑deaf responses (e.g., endorsing inappropriate customs).
  • Content Moderation – Automated moderation tools that rely on LLM judgments may miss culturally sensitive violations, increasing the risk of platform misuse.
  • Localization Pipelines – Companies translating UI text or generating culturally tailored marketing copy should not assume that a high‑performing Persian LLM automatically understands local etiquette.
  • Model‑as‑a‑Service – Service providers need to expose “cultural‑reasoning” health checks (similar to DivanBench) as part of their SLA to assure enterprise customers.

Limitations & Future Work

  • Scope of Cultural Domains – DivanBench focuses on superstitions and customs; other cultural dimensions (e.g., religious discourse, regional dialects) remain untested.
  • Benchmark Size – 315 items provide a solid diagnostic signal but may not capture the full variability of real‑world interactions.
  • Model Diversity – Only publicly released Persian LLMs were evaluated; proprietary or multimodal models could behave differently.
  • Future Directions – The authors suggest augmenting training data with contrastive cultural examples, integrating explicit knowledge graphs of Persian customs, and developing fine‑tuning objectives that penalize acquiescence bias.

Bottom line: Scaling up Persian language data alone isn’t enough. To build truly culturally competent AI, developers must go beyond memorizing facts and embed reasoning mechanisms that can differentiate “right” from “wrong” in context‑rich social settings. DivanBench offers a practical yardstick for measuring progress toward that goal.

Authors

  • Alireza Sakhaeirad
  • Ali Ma’manpoosh
  • Arshia Hemmat

Paper Information

  • arXiv ID: 2602.17623v1
  • Categories: cs.CL
  • Published: February 19, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »