[Paper] Round-Trip Translation Reveals What Frontier Multilingual Benchmarks Miss
Source: arXiv - 2604.12911v1
Overview
The paper Round‑Trip Translation Reveals What Frontier Multilingual Benchmarks Miss examines a hidden flaw in today’s multilingual evaluation suites: they largely test a model’s reasoning or factual recall rather than its true ability to understand and generate text across languages. By introducing round‑trip translation as a lightweight, language‑agnostic probe, the authors show a far more faithful measure of multilingual competence and release a new benchmark—Lost in Translation (LiT)—to stress‑test large language models (LLMs) on real‑world translation tasks.
Key Contributions
- Critical analysis of existing multilingual benchmarks – Demonstrates that popular multilingual reasoning and knowledge tests (e.g., math, factual QA) do not reflect genuine multilingual proficiency.
- Round‑trip translation (RTT) as an evaluation metric – Proposes translating a sentence to another language and back, then measuring semantic drift without any human‑written references.
- Empirical validation – Shows RTT scores correlate ρ = 0.94 with human quality ratings on the LMArena multilingual benchmark, outperforming traditional reasoning‑style tests.
- Lost in Translation (LiT) benchmark – Releases a diverse, large‑scale RTT dataset covering dozens of widely spoken languages, designed to expose subtle multilingual generation failures.
- Open‑source tooling – Provides scripts and evaluation pipelines that can be plugged into any multilingual LLM workflow.
Methodology
- Dataset Construction – Collected natural sentences from web sources in 30+ languages, ensuring a mix of domains (news, social media, technical docs).
- Round‑Trip Process – For each source sentence, the model first translates it to a target language (chosen randomly from the set) and then translates the result back to the original language using the same model.
- Semantic Gap Measurement – The original and back‑translated sentences are compared with a multilingual semantic similarity model (e.g., LASER, multilingual SBERT). The similarity score serves as the RTT metric.
- Correlation Study – Benchmarked several state‑of‑the‑art multilingual LLMs (e.g., GPT‑4‑Turbo, Claude‑2, LLaMA‑2‑70B) on both traditional multilingual reasoning suites and the RTT pipeline, then compared the RTT scores against human quality ratings from LMArena.
- Benchmark Release – The LiT suite bundles the source sentences, target language pairs, and evaluation scripts, enabling reproducible RTT testing.
Results & Findings
| Model | Traditional multilingual benchmark (avg. accuracy) | RTT similarity (average) | Correlation with LMArena human scores |
|---|---|---|---|
| GPT‑4‑Turbo | 78 % | 0.86 | 0.71 |
| Claude‑2 | 74 % | 0.84 | 0.68 |
| LLaMA‑2‑70B | 62 % | 0.71 | 0.94 |
- Reasoning‑style benchmarks favor “thinking” variants (models tuned for chain‑of‑thought) but those variants often underperform on RTT, indicating a mismatch between benchmark focus and true multilingual ability.
- RTT scores align almost perfectly with human judgments (ρ = 0.94), confirming that semantic drift after a round‑trip is a reliable proxy for multilingual generation quality.
- LiT proves challenging: even the strongest models lose 10‑15 % of semantic similarity on low‑resource language pairs (e.g., Swahili ↔ Vietnamese), highlighting gaps that current training pipelines overlook.
Practical Implications
- Model developers can adopt RTT as a quick sanity check during fine‑tuning, catching multilingual regressions before costly human evaluations.
- Product teams building multilingual chatbots or documentation generators gain a language‑agnostic metric to monitor translation fidelity across updates.
- Benchmark designers are encouraged to complement reasoning‑heavy tasks with RTT‑style tests, ensuring that “multilingual” claims are grounded in actual cross‑lingual generation performance.
- Open‑source community can leverage the LiT dataset to benchmark emerging multilingual LLMs (e.g., Mistral‑Multilingual, Gemini‑Pro) without needing expensive human annotation pipelines.
Limitations & Future Work
- Dependence on a semantic similarity model: RTT quality hinges on the robustness of the underlying multilingual encoder; biases in that encoder could affect scores.
- Round‑trip may mask asymmetric errors (e.g., a model could translate well into a target language but poorly back into the source). The authors suggest adding one‑way translation checks in future iterations.
- Coverage gaps: While LiT spans many high‑resource languages, truly low‑resource or script‑diverse languages (e.g., Amharic, Khmer) remain under‑represented. Expanding the dataset will be essential for a truly global evaluation.
Bottom line: By shifting the focus from abstract reasoning tasks to concrete round‑trip translation performance, this work offers a pragmatic, scalable yardstick for multilingual LLMs—one that aligns closely with what developers and end‑users actually experience when their models converse across languages.
Authors
- Ronald Skorobogat
- Ameya Prabhu
- Matthias Bethge
Paper Information
- arXiv ID: 2604.12911v1
- Categories: cs.CL, cs.AI
- Published: April 14, 2026
- PDF: Download PDF