[Paper] Multilingual Large Language Models do not comprehend all natural languages to equal degrees

Published: (February 23, 2026 at 12:22 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.20065v1

Overview

A recent study probes how well three popular multilingual large language models (LLMs) actually understand a diverse set of natural languages. By testing the models on a language‑comprehension benchmark covering 12 typologically distinct languages, the authors reveal that performance varies widely—and surprisingly, English is not the strongest language for any of the models.

Key Contributions

  • Cross‑lingual evaluation of three state‑of‑the‑art multilingual LLMs on a unified comprehension task spanning Indo‑European, Afro‑Asiatic, Turkic, Sino‑Tibetan, and Japonic families.
  • Human baseline comparison, showing that all models lag behind native speakers, but the gap differs per language.
  • Counter‑intuitive finding that several Romance languages (including lower‑resource ones) consistently outperform English.
  • Systematic analysis of factors influencing performance: tokenization granularity, linguistic distance from English/Spanish, volume and provenance of training data, and the WEIRD vs. non‑WEIRD data split.
  • Open‑source benchmark and prompting scripts to enable reproducibility and future extensions.

Methodology

  1. Task selection – The authors used a language‑comprehension benchmark that asks models to answer multiple‑choice questions based on short passages (e.g., “Which sentence best continues the story?”). The task is language‑agnostic and measures pure understanding rather than generation.
  2. Model suite – Three widely used multilingual LLMs were evaluated:
    • LLaMA‑2‑13B‑Chat (open‑source)
    • Mistral‑7B‑Instruct (open‑source)
    • GPT‑4‑Turbo (closed‑source, accessed via API)
  3. Prompt engineering – A zero‑shot prompting template was crafted in each target language, keeping the wording identical across languages to avoid bias.
  4. Languages – Twelve languages were chosen to represent five language families and a spectrum of resource levels (e.g., English, Spanish, French, Italian, Portuguese, Arabic, Turkish, Mandarin, Japanese, Amharic, Kurdish, and Basque).
  5. Human baseline – Native speakers answered the same questions, providing an upper‑bound for performance.
  6. Analysis – Accuracy scores were correlated with metadata such as token‑vocab size per language, amount of pre‑training data (estimated from public corpora), and linguistic distance metrics (Levenshtein distance, typological features).

Results & Findings

LanguageGPT‑4‑TurboLLaMA‑2‑13B‑ChatMistral‑7B‑InstructHuman Baseline
English78 %71 %69 %96 %
Spanish82 %75 %73 %97 %
French80 %73 %71 %96 %
Italian79 %72 %70 %95 %
Portuguese78 %71 %69 %95 %
Arabic65 %58 %56 %92 %
Turkish63 %55 %53 %90 %
Mandarin60 %52 %50 %93 %
Japanese58 %51 %49 %94 %
Amharic52 %44 %42 %88 %
Kurdish55 %47 %45 %89 %
Basque57 %49 %47 %90 %

Key take‑aways

  • Romance languages consistently beat English across all three models, with Spanish leading the pack.
  • Performance correlates strongly with token‑vocab coverage: languages with richer subword tokenization (e.g., Spanish) achieve higher accuracy.
  • Training‑data volume matters, but the relationship is not linear; a modest amount of high‑quality data (as in many Romance languages) can outweigh larger but noisier corpora.
  • Linguistic distance from English/Spanish explains part of the variance—languages that share morphology or word order with the model’s dominant training languages fare better.
  • All models lag behind humans, confirming that current multilingual LLMs are still far from true comprehension.

Practical Implications

  1. Product localization – Companies relying on LLMs for multilingual chatbots or content generation should not assume English‑level quality for all languages. Romance‑language markets may already be ready for near‑production use, while Arabic, Mandarin, or Amharic may need additional post‑processing or human‑in‑the‑loop safeguards.
  2. Prompt design – Tokenization‑aware prompting (e.g., using language‑specific tokenizers or adding explicit delimiters) can boost performance for low‑resource languages.
  3. Data collection strategy – Investing in curated, high‑quality corpora for under‑represented languages yields outsized gains compared to simply scaling raw web data.
  4. Evaluation pipelines – The benchmark and scripts released by the authors can be integrated into CI/CD for LLM‑powered services, ensuring that updates do not degrade performance in non‑English locales.
  5. Policy & fairness – The findings highlight a hidden bias: “WEIRD” data dominance translates into uneven user experiences. Organizations aiming for inclusive AI should prioritize balanced multilingual training sets.

Limitations & Future Work

  • Model scope – Only three models were examined; newer open‑source multilingual LLMs (e.g., Gemma, LLaVA‑Multilingual) may exhibit different patterns.
  • Task narrowness – The comprehension benchmark focuses on multiple‑choice reading comprehension; other tasks (code generation, reasoning, dialog) might reveal distinct language‑specific strengths or weaknesses.
  • Training‑data estimates – Publicly available statistics on per‑language token counts are approximate, limiting the precision of the data‑size analysis.
  • Human baseline variability – The human participants were not uniformly matched for education or exposure to the test format, which could slightly inflate the human‑model gap.
  • Future directions suggested by the authors include expanding the language set to include more low‑resource and typologically extreme languages (e.g., polysynthetic languages), testing retrieval‑augmented LLMs, and exploring fine‑tuning strategies that explicitly address tokenization and linguistic distance.

Authors

  • Natalia Moskvina
  • Raquel Montero
  • Masaya Yoshida
  • Ferdy Hubers
  • Paolo Morosi
  • Walid Irhaymi
  • Jin Yan
  • Tamara Serrano
  • Elena Pagliarini
  • Fritz Günther
  • Evelina Leivada

Paper Information

  • arXiv ID: 2602.20065v1
  • Categories: cs.CL, cs.AI
  • Published: February 23, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »