[Paper] Multilingual Large Language Models do not comprehend all natural languages to equal degrees

Published: 3 days ago (February 23, 2026 at 12:22 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.20065v1

Overview

A recent study probes how well three popular multilingual large language models (LLMs) actually understand a diverse set of natural languages. By testing the models on a language‑comprehension benchmark covering 12 typologically distinct languages, the authors reveal that performance varies widely—and surprisingly, English is not the strongest language for any of the models.

Key Contributions

Cross‑lingual evaluation of three state‑of‑the‑art multilingual LLMs on a unified comprehension task spanning Indo‑European, Afro‑Asiatic, Turkic, Sino‑Tibetan, and Japonic families.
Human baseline comparison, showing that all models lag behind native speakers, but the gap differs per language.
Counter‑intuitive finding that several Romance languages (including lower‑resource ones) consistently outperform English.
Systematic analysis of factors influencing performance: tokenization granularity, linguistic distance from English/Spanish, volume and provenance of training data, and the WEIRD vs. non‑WEIRD data split.
Open‑source benchmark and prompting scripts to enable reproducibility and future extensions.

Methodology

Task selection – The authors used a language‑comprehension benchmark that asks models to answer multiple‑choice questions based on short passages (e.g., “Which sentence best continues the story?”). The task is language‑agnostic and measures pure understanding rather than generation.
Model suite – Three widely used multilingual LLMs were evaluated:
- LLaMA‑2‑13B‑Chat (open‑source)
- Mistral‑7B‑Instruct (open‑source)
- GPT‑4‑Turbo (closed‑source, accessed via API)
Prompt engineering – A zero‑shot prompting template was crafted in each target language, keeping the wording identical across languages to avoid bias.
Languages – Twelve languages were chosen to represent five language families and a spectrum of resource levels (e.g., English, Spanish, French, Italian, Portuguese, Arabic, Turkish, Mandarin, Japanese, Amharic, Kurdish, and Basque).
Human baseline – Native speakers answered the same questions, providing an upper‑bound for performance.
Analysis – Accuracy scores were correlated with metadata such as token‑vocab size per language, amount of pre‑training data (estimated from public corpora), and linguistic distance metrics (Levenshtein distance, typological features).

Results & Findings

Language	GPT‑4‑Turbo	LLaMA‑2‑13B‑Chat	Mistral‑7B‑Instruct	Human Baseline
English	78 %	71 %	69 %	96 %
Spanish	82 %	75 %	73 %	97 %
French	80 %	73 %	71 %	96 %
Italian	79 %	72 %	70 %	95 %
Portuguese	78 %	71 %	69 %	95 %
Arabic	65 %	58 %	56 %	92 %
Turkish	63 %	55 %	53 %	90 %
Mandarin	60 %	52 %	50 %	93 %
Japanese	58 %	51 %	49 %	94 %
Amharic	52 %	44 %	42 %	88 %
Kurdish	55 %	47 %	45 %	89 %
Basque	57 %	49 %	47 %	90 %

Key take‑aways

Romance languages consistently beat English across all three models, with Spanish leading the pack.
Performance correlates strongly with token‑vocab coverage: languages with richer subword tokenization (e.g., Spanish) achieve higher accuracy.
Training‑data volume matters, but the relationship is not linear; a modest amount of high‑quality data (as in many Romance languages) can outweigh larger but noisier corpora.
Linguistic distance from English/Spanish explains part of the variance—languages that share morphology or word order with the model’s dominant training languages fare better.
All models lag behind humans, confirming that current multilingual LLMs are still far from true comprehension.

Practical Implications

Product localization – Companies relying on LLMs for multilingual chatbots or content generation should not assume English‑level quality for all languages. Romance‑language markets may already be ready for near‑production use, while Arabic, Mandarin, or Amharic may need additional post‑processing or human‑in‑the‑loop safeguards.
Prompt design – Tokenization‑aware prompting (e.g., using language‑specific tokenizers or adding explicit delimiters) can boost performance for low‑resource languages.
Data collection strategy – Investing in curated, high‑quality corpora for under‑represented languages yields outsized gains compared to simply scaling raw web data.
Evaluation pipelines – The benchmark and scripts released by the authors can be integrated into CI/CD for LLM‑powered services, ensuring that updates do not degrade performance in non‑English locales.
Policy & fairness – The findings highlight a hidden bias: “WEIRD” data dominance translates into uneven user experiences. Organizations aiming for inclusive AI should prioritize balanced multilingual training sets.

Limitations & Future Work

Model scope – Only three models were examined; newer open‑source multilingual LLMs (e.g., Gemma, LLaVA‑Multilingual) may exhibit different patterns.
Task narrowness – The comprehension benchmark focuses on multiple‑choice reading comprehension; other tasks (code generation, reasoning, dialog) might reveal distinct language‑specific strengths or weaknesses.
Training‑data estimates – Publicly available statistics on per‑language token counts are approximate, limiting the precision of the data‑size analysis.
Human baseline variability – The human participants were not uniformly matched for education or exposure to the test format, which could slightly inflate the human‑model gap.
Future directions suggested by the authors include expanding the language set to include more low‑resource and typologically extreme languages (e.g., polysynthetic languages), testing retrieval‑augmented LLMs, and exploring fine‑tuning strategies that explicitly address tokenization and linguistic distance.

Authors

Natalia Moskvina
Raquel Montero
Masaya Yoshida
Ferdy Hubers
Paolo Morosi
Walid Irhaymi
Jin Yan
Tamara Serrano
Elena Pagliarini
Fritz Günther
Evelina Leivada

Paper Information

arXiv ID: 2602.20065v1
Categories: cs.CL, cs.AI
Published: February 23, 2026
PDF: Download PDF

[Paper] Multilingual Large Language Models do not comprehend all natural languages to equal degrees

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets

[Paper] GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

[Paper] Dynamic Personality Adaptation in Large Language Models via State Machines

[Paper] When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models