[Paper] Halluverse-M^3: A multitask multilingual benchmark for hallucination in LLMs

Published: 2 months ago (February 6, 2026 at 01:16 PM EST)

5 min read

Source: arXiv

Source: arXiv

Overview

The paper introduces Halluverse‑M³, a new multilingual benchmark that enables researchers and engineers to systematically probe how large language models (LLMs) hallucinate—i.e., generate content that is factually inaccurate or unrelated.

Key Features

Languages covered: English, Arabic, Hindi, Turkish
Generation tasks: Question answering, Dialogue summarisation

By addressing these four languages and two tasks, the dataset highlights gaps that standard English‑only benchmarks miss, helping the community build more reliable, globally usable AI systems.

Key Contributions

Multilingual scope – First hallucination benchmark that spans high‑resource (English) and lower‑resource (Arabic, Hindi, Turkish) languages.
Multi‑task design – Includes both extractive (QA) and abstractive (dialogue summarisation) generation tasks, exposing different failure modes.
Fine‑grained hallucination taxonomy – Separates hallucinations into entity‑level, relation‑level, and sentence‑level categories, enabling precise diagnostics.
Controlled hallucination generation – Uses a systematic editing pipeline plus human validation to guarantee that each “hallucinated” output has a known ground‑truth deviation from the source.
Comprehensive evaluation – Benchmarks a wide range of open‑source and commercial LLMs on detection accuracy, revealing language‑ and task‑specific performance trends.
Open release – Dataset and evaluation scripts are publicly available on Hugging Face, encouraging reproducibility and downstream research.

Methodology

Data collection – The authors start from existing QA and dialogue corpora in the four target languages.
Hallucination injection – For each source instance, they apply a rule‑based editing process that deliberately introduces factual errors at three granularities:
- Entity: swap or insert wrong named entities.
- Relation: alter the relationship between correct entities.
- Sentence: replace whole sentences with unrelated but fluent text.
Human validation – Crowdworkers verify that the edited outputs indeed contain the intended hallucination type and that the rest of the text remains coherent.
Benchmark construction – The final dataset pairs the original input, the hallucinated generation, and a label indicating the hallucination category.
Evaluation protocol – Models are prompted to produce a generation (or are given a pre‑generated one) and then asked to flag whether it contains a hallucination. Detection performance is measured with standard metrics (accuracy, F1) per language, task, and hallucination level.

Results & Findings

Language	Task	Best detection accuracy (overall)
English	QA	~92 %
English	Summ.	~78 %
Arabic	QA	~85 %
Arabic	Summ.	~70 %
Hindi	QA	~78 %
Hindi	Summ.	~62 %
Turkish	QA	~84 %
Turkish	Summ.	~73 %

Task difficulty – QA is consistently easier for models to verify than dialogue summarisation, likely because the answer span is more constrained.
Hallucination granularity – Sentence‑level hallucinations are the hardest to detect across all languages; entity‑level errors are caught relatively well.
Language disparity – Performance drops noticeably as we move from English to lower‑resource languages, with Hindi showing the steepest decline.
Model variance – Even the strongest proprietary models (e.g., GPT‑4‑style) struggle with sentence‑level errors in non‑English settings, highlighting a systemic gap rather than a single model flaw.

Practical Implications

Better QA assistants – Developers building multilingual chatbots can use Halluverse‑M³ to fine‑tune hallucination detectors, reducing the risk of providing users with incorrect answers.
Content‑moderation pipelines – The fine‑grained taxonomy helps prioritize which error types need stricter checks (e.g., sentence‑level fabrications in summarisation tools).
Model selection for global products – Benchmarks reveal that a model that excels in English may underperform dramatically in Hindi or Arabic; teams can make more informed trade‑offs or allocate resources for language‑specific fine‑tuning.
Dataset‑driven debugging – By exposing where a model fails (entity vs. relation vs. sentence), engineers can target data augmentation or prompt‑engineering strategies to patch specific weaknesses.
Open‑source community benefit – Since the dataset is freely available, startups and research labs can benchmark their own LLMs without needing costly proprietary data, accelerating the development of more trustworthy multilingual AI.

Limitations & Future Work

Scope of languages – Only four languages are covered; many low‑resource languages (e.g., Swahili, Bengali) remain untested.
Task variety – The benchmark focuses on QA and dialogue summarisation; other generation scenarios such as code synthesis or creative writing are not addressed.
Synthetic hallucinations – Although human‑validated, the hallucinations are artificially injected, which may not capture all real‑world error patterns that arise during unconstrained generation.
Detection‑only focus – The study evaluates detection accuracy but does not explore mitigation techniques (e.g., post‑editing or self‑correction).
Model prompting consistency – Different models were queried with slightly different prompts, which could affect fairness; a unified prompting framework would improve comparability.

Halluverse‑M³ opens a much‑needed window into multilingual hallucination behavior. By providing a rigorously constructed, openly shared benchmark, it equips developers with the tools to diagnose, detect, and eventually curb factual errors in the next generation of globally‑deployed LLMs.

## Authors

- **Samir Abdaljalil**
- **Parichit Sharma**
- **Erchin Serpedin**
- **Hasan Kurban**

Paper Information

Field	Details
arXiv ID	`arXiv:2602.06920v1`
Categories	`cs.CL`, `cs.AI`
Published	February 6, 2026
PDF	Download PDF

[Paper] Halluverse-M^3: A multitask multilingual benchmark for hallucination in LLMs

Overview

Key Features

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Paper Information

Related posts

The Machine Learning Lessons I’ve Learned Last Month

GPT-5.3-Codex

Fundamental emerges from stealth with first major foundation model trained for tabular data

OpenAI Frontier