[Paper] Polyglots or Multitudes? Multilingual LLM Answers to Value-laden Multiple-Choice Questions

Published: 2 months ago (February 5, 2026 at 12:44 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.05932v1

Overview

This paper asks a surprisingly concrete question: Do multilingual large language models (LLMs) give the same value‑laden answers regardless of the language they’re asked in? By probing dozens of LLMs with human‑translated, culturally‑neutral multiple‑choice questions in eight European languages, the authors uncover when models behave like true “polyglots” (consistent across languages) and when they act like a collection of monolingual models with divergent value systems.

Key Contributions

MEVS dataset – a publicly released Multilingual European Value Survey containing human‑translated, aligned MCQs in English, French, German, Spanish, Italian, Dutch, Portuguese, and Polish.
Large‑scale multilingual evaluation – over 30 multilingual LLMs (varying in size, architecture, and alignment strategy) are tested on a controlled subset of MEVS.
Systematic prompt engineering – the study varies answer order, bullet symbols, and trailing characters to isolate prompt‑sensitivity effects.
Consistency metrics – introduces quantitative measures for intra‑model (same model, different languages) and inter‑model (different models, same language) agreement on value‑laden MCQs.
Empirical insight – shows that instruction‑tuned, larger models are generally more consistent, yet language‑specific divergences still appear on a non‑trivial subset of questions.

Methodology

Corpus Construction
- Selected a set of value‑oriented questions from the European Values Survey.
- Employed professional translators to produce parallel versions in eight languages, avoiding the noise of automatic translation.
Model Suite
- Included open‑source LLM families (e.g., LLaMA, Mistral, BLOOM) and commercial APIs (e.g., GPT‑4, Claude).
- Covered three size brackets: small (≈1–3 B parameters), medium (≈7–13 B), and large (≥30 B).
Prompt Design
- Each MCQ was presented with four answer options (A–D).
- For each language, the authors generated multiple prompt variants:
  Answer order: original vs. shuffled.
  Symbol type: “A)”, “①”, “-”.
  Tail character: period, question mark, or none.
Evaluation Procedure
- Run each model on every prompt variant, record the selected option.
- Compute intra‑model consistency (same model, different languages) and inter‑model consistency (different models, same language).
- Perform statistical analyses to identify questions with high vs. low agreement.

Results & Findings

Aspect	What the Numbers Say
Overall consistency	Instruction‑tuned, larger models achieve ~85 % intra‑model agreement across languages, compared to ~60 % for non‑tuned or smaller models.
Question‑level variance	~30 % of MCQs yield perfect agreement (all models pick the same answer in every language). The remaining questions split roughly 55 %/45 % or even 70 %/30 % across answer choices.
Language‑specific drift	Even the most consistent models show systematic shifts on certain items (e.g., “Should the state intervene in the economy?”) where French‑language prompts lean more toward “government role” than English prompts.
Prompt robustness	Shuffling answer order or changing bullet symbols rarely changes the selected answer (<5 % impact), but adding or removing a trailing period can flip responses on borderline questions.
Effect of fine‑tuning	Preference‑fine‑tuned models (e.g., RLHF‑aligned) exhibit selective language effects: they stay consistent on factual items but diverge on normative questions.

In short, multilingual LLMs are not perfect polyglots. Their “values” can be subtly nudged by the language of the prompt, especially on culturally loaded topics.

Practical Implications

Product Localization – Companies deploying LLM‑powered chatbots or decision‑support tools should not assume that a model’s ethical stance stays constant across locales. A policy that feels “neutral” in English might be interpreted differently in German or Italian.
Compliance & Auditing – Regulators evaluating AI for bias or value alignment need multilingual test suites (like MEVS) to catch language‑specific deviations before certification.
Prompt Engineering – Minor punctuation choices can affect outcomes on sensitive questions; standardizing prompt templates per language can improve reliability.
Model Selection – For applications where value consistency matters (e.g., HR screening, content moderation), opting for larger, instruction‑tuned models reduces but does not eliminate language‑drift risk.
Fine‑tuning Strategies – The selective effect of preference fine‑tuning suggests that targeted multilingual alignment (e.g., value‑preserving RLHF across languages) could be a fruitful research direction for industry.

Limitations & Future Work

Scope of Languages – The study focuses on eight European languages; results may differ for non‑Indo‑European languages with distinct cultural frames.
Question Set Size – Only a subset of the full MEVS questionnaire was used; broader coverage could reveal additional patterns.
Model Diversity – While 30+ models were tested, the rapidly evolving landscape (e.g., emerging multimodal LLMs) may exhibit different behaviors.
Human Baseline – The paper does not compare model variance to human respondents across languages, leaving open the question of whether observed drifts are larger or smaller than natural cultural variation.
Fine‑tuning Granularity – Future work could explore language‑aware RLHF pipelines that explicitly penalize cross‑lingual value divergence.

Bottom line: Multilingual LLMs are getting better at staying on message across languages, but they’re not yet the “one‑model‑fits‑all” polyglots we might hope for. Developers building globally‑deployed AI should test and, if needed, fine‑tune models in each target language to ensure consistent, value‑aligned behavior.

Authors

Léo Labat
Etienne Ollion
François Yvon

Paper Information

arXiv ID: 2602.05932v1
Categories: cs.CL
Published: February 5, 2026
PDF: Download PDF

[Paper] Polyglots or Multitudes? Multilingual LLM Answers to Value-laden Multiple-Choice Questions

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

[Paper] Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay

[Paper] Uncovering Cross-Objective Interference in Multi-Objective Alignment

[Paper] SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks