[Paper] Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets

Published: 3 days ago (February 25, 2026 at 01:58 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.22207v1

Overview

Multilingual evaluation of large language models (LLMs) has been hampered by low‑quality translations of benchmark datasets, which often introduce semantic drift and strip away task‑specific context. Yukhymenko et al. propose a fully automated pipeline that produces high‑fidelity translations while preserving the original structure of the tasks. By integrating test‑time compute scaling tricks—Universal Self‑Improvement (USI) and a new multi‑round ranking method called T‑RANK—the authors achieve translations that are demonstrably better than existing resources, enabling more trustworthy multilingual LLM assessments.

Key Contributions

End‑to‑end automated translation framework for benchmarks and datasets, eliminating the need for manual post‑editing.
Universal Self‑Improvement (USI) adaptation for translation: a test‑time scaling technique that iteratively refines outputs without retraining the model.
T‑RANK, a novel multi‑round ranking algorithm that selects the most semantically faithful translation from a pool of candidates.
Large‑scale multilingual rollout: translation of popular benchmarks into eight Eastern and Southern European languages (Ukrainian, Bulgarian, Slovak, Romanian, Lithuanian, Estonian, Turkish, Greek).
Comprehensive evaluation using both reference‑based metrics (BLEU, COMET) and LLM‑as‑a‑judge assessments, showing consistent gains over prior translation resources.
Open‑source release of both the pipeline code and the newly translated benchmark suites.

Methodology

Dataset Ingestion – Original English benchmarks are parsed to extract prompts, inputs, and expected outputs while preserving task metadata (e.g., multiple‑choice options, code snippets).
Candidate Generation – A strong multilingual LLM (e.g., GPT‑4‑Turbo) generates N translation candidates per item under a high compute budget.
Universal Self‑Improvement (USI) – At test time, the model re‑evaluates each candidate with a larger context window and higher temperature sampling, producing refined versions without fine‑tuning.
T‑RANK Multi‑Round Ranking
- Round 1: A lightweight scorer (trained on a small parallel corpus) filters out low‑quality candidates.
- Round 2: The remaining candidates are re‑scored using a larger LLM that judges semantic fidelity, style, and task‑preserving properties.
- Final Selection: The top‑ranked translation is kept; the rest are discarded.
Post‑Processing & Validation – Simple rule‑based checks ensure format compliance (e.g., JSON schema, code syntax). The pipeline then outputs a ready‑to‑use localized benchmark.

All steps are orchestrated via a modular Python library, making it easy to plug in different LLM back‑ends or ranking models.

Results & Findings

Language	BLEU ↑	COMET ↑	LLM‑as‑Judge Preference
Ukrainian	38.2 → 44.7	0.71 → 0.84	68 % vs. 32 % (baseline)
Turkish	35.6 → 42.1	0.68 → 0.80	71 % vs. 29 %
Greek	36.9 → 43.3	0.70 → 0.82	66 % vs. 34 %
… (others)	similar gains	similar gains	consistent majority preference

Semantic drift reduced: Human evaluators reported a 45 % drop in meaning‑altering errors compared to the previous state‑of‑the‑art translations.
Task structure preserved: For code‑generation and multi‑choice tasks, the pipeline maintained exact answer formats >98 % of the time, whereas baseline translations broke the format in ~7 % of cases.
Downstream impact: When evaluating a multilingual LLM on the newly translated benchmarks, performance gaps between English and target languages narrowed by an average of 12 percentage points, indicating a more faithful measurement of the model’s true capabilities.

Practical Implications

More reliable multilingual benchmarking: Developers can now compare LLMs across languages without worrying that translation artifacts are inflating or deflating scores.
Rapid localization of new datasets: The pipeline can be hooked into CI/CD pipelines to auto‑translate emerging benchmarks (e.g., new coding challenges, safety tests) as soon as they are released.
Cost‑effective scaling: By leveraging test‑time compute scaling (USI) instead of full model fine‑tuning, organizations can achieve high‑quality translations with modest GPU budgets.
Improved product QA: Companies building multilingual AI assistants can use the translated benchmarks to stress‑test language‑specific edge cases before launch.
Open‑source community boost: The released codebase invites contributions (e.g., adding support for additional languages or domain‑specific vocabularies), fostering a shared ecosystem for multilingual evaluation.

Limitations & Future Work

Language coverage: The current release focuses on eight Eastern/Southern European languages; low‑resource languages with scarce parallel data may still suffer from quality gaps.
Compute overhead: USI and multi‑round ranking increase inference time per item, which could be prohibitive for extremely large corpora without batching optimizations.
Domain specificity: Benchmarks with highly technical jargon (e.g., medical or legal) were not explicitly tested; future work should evaluate domain‑adapted ranking models.
Human‑in‑the‑loop refinement: While fully automated, a lightweight human verification step could catch rare edge‑case errors; integrating such a step is an open research direction.

The authors plan to extend the framework to cover more languages, explore adaptive compute budgeting (spending more cycles only on ambiguous items), and open up a leaderboard for community‑submitted translation quality metrics.

Authors

Hanna Yukhymenko
Anton Alexandrov
Martin Vechev

Paper Information

arXiv ID: 2602.22207v1
Categories: cs.CL, cs.AI, cs.LG
Published: February 25, 2026
PDF: Download PDF

[Paper] Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

[Paper] SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

[Paper] AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning

[Paper] Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?