[Paper] Halluverse-M^3: A multitask multilingual benchmark for hallucination in LLMs

Published: (February 6, 2026 at 01:16 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.06920v1

Overview

The paper introduces Halluverse‑M³, a new multilingual benchmark that lets researchers and engineers systematically probe how large language models (LLMs) hallucinate—i.e., generate content that is factually inaccurate or unrelated. By covering four languages (English, Arabic, Hindi, Turkish) and two generation tasks (question answering and dialogue summarisation), the dataset shines a light on gaps that standard English‑only benchmarks miss, helping the community build more reliable, globally‑usable AI systems.

Key Contributions

  • Multilingual scope – First hallucination benchmark that spans high‑resource (English) and lower‑resource (Arabic, Hindi, Turkish) languages.
  • Multi‑task design – Includes both extractive (QA) and abstractive (dialogue summarisation) generation tasks, exposing different failure modes.
  • Fine‑grained hallucination taxonomy – Separates hallucinations into entity‑level, relation‑level, and sentence‑level categories, enabling precise diagnostics.
  • Controlled hallucination generation – Uses a systematic editing pipeline plus human validation to guarantee that each “hallucinated” output has a known ground‑truth deviation from the source.
  • Comprehensive evaluation – Benchmarks a wide range of open‑source and commercial LLMs on detection accuracy, revealing language‑ and task‑specific performance trends.
  • Open release – Dataset and evaluation scripts are publicly available on Hugging Face, encouraging reproducibility and downstream research.

Methodology

  1. Data collection – The authors start from existing QA and dialogue corpora in the four target languages.
  2. Hallucination injection – For each source instance, they apply a rule‑based editing process that deliberately introduces factual errors at three granularities:
    • Entity: swap or insert wrong named entities.
    • Relation: alter the relationship between correct entities.
    • Sentence: replace whole sentences with unrelated but fluent text.
  3. Human validation – Crowdworkers verify that the edited outputs indeed contain the intended hallucination type and that the rest of the text remains coherent.
  4. Benchmark construction – The final dataset pairs the original input, the hallucinated generation, and a label indicating the hallucination category.
  5. Evaluation protocol – Models are prompted to produce a generation (or are given a pre‑generated one) and then asked to flag whether it contains a hallucination. Detection performance is measured with standard metrics (accuracy, F1) per language, task, and hallucination level.

Results & Findings

LanguageTaskBest detection accuracy (overall)
EnglishQA~92 %
EnglishSumm.~78 %
ArabicQA~85 %
ArabicSumm.~70 %
HindiQA~78 %
HindiSumm.~62 %
TurkishQA~84 %
TurkishSumm.~73 %
  • Task difficulty – QA is consistently easier for models to verify than dialogue summarisation, likely because the answer span is more constrained.
  • Hallucination granularity – Sentence‑level hallucinations are the hardest to detect across all languages; entity‑level errors are caught relatively well.
  • Language disparity – Performance drops noticeably as we move from English to lower‑resource languages, with Hindi showing the steepest decline.
  • Model variance – Even the strongest proprietary models (e.g., GPT‑4‑style) struggle with sentence‑level errors in non‑English settings, highlighting a systemic gap rather than a single model flaw.

Practical Implications

  • Better QA assistants – Developers building multilingual chatbots can use Halluverse‑M³ to fine‑tune hallucination detectors, reducing the risk of providing users with incorrect answers.
  • Content moderation pipelines – The fine‑grained taxonomy helps prioritize which types of errors need stricter checks (e.g., sentence‑level fabrications in summarisation tools).
  • Model selection for global products – Benchmarks reveal that a model that excels in English may underperform dramatically in Hindi or Arabic; teams can make more informed trade‑offs or allocate resources for language‑specific fine‑tuning.
  • Dataset‑driven debugging – By exposing where a model fails (entity vs. relation vs. sentence), engineers can target data augmentation or prompt‑engineering strategies to patch specific weaknesses.
  • Open‑source community benefit – Since the dataset is freely available, startups and research labs can benchmark their own LLMs without needing costly proprietary data, accelerating the development of more trustworthy multilingual AI.

Limitations & Future Work

  • Scope of languages – Only four languages are covered; many low‑resource languages (e.g., Swahili, Bengali) remain untested.
  • Task variety – The benchmark focuses on QA and dialogue summarisation; other generation scenarios such as code synthesis or creative writing are not addressed.
  • Synthetic hallucinations – Although human‑validated, the hallucinations are artificially injected, which may not capture all real‑world error patterns that arise during unconstrained generation.
  • Detection‑only focus – The study evaluates detection accuracy but does not explore mitigation techniques (e.g., post‑editing or self‑correction).
  • Model prompting consistency – Different models were queried with slightly different prompts, which could affect fairness; a unified prompting framework would improve comparability.

Halluverse‑M³ opens a much‑needed window into multilingual hallucination behavior. By providing a rigorously constructed, openly shared benchmark, it equips developers with the tools to diagnose, detect, and eventually curb factual errors in the next generation of globally‑deployed LLMs.

Authors

  • Samir Abdaljalil
  • Parichit Sharma
  • Erchin Serpedin
  • Hasan Kurban

Paper Information

  • arXiv ID: 2602.06920v1
  • Categories: cs.CL, cs.AI
  • Published: February 6, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »