[Paper] A Dataset for Probing Translationese Preferences in English-to-Swedish Translation

Published: (March 9, 2026 at 10:46 AM EDT)
4 min read
Source: arXiv

Source: arXiv - 2603.08450v1

Overview

The paper presents the first publicly released English‑to‑Swedish dataset that pairs literal, source‑driven translations (often called translationese) with more natural, idiomatic alternatives. By annotating each pair with the specific translation errors, the authors create a benchmark for testing whether language models prefer the “faithful‑to‑source” wording or the smoother native‑like phrasing.

Key Contributions

  • A new bilingual corpus (≈ X k sentence pairs) that explicitly contrasts translationese with idiomatic Swedish rewrites.
  • Fine‑grained error tags (e.g., lexical calque, syntactic calque, unnatural collocation) that describe why the literal translation sounds off.
  • Benchmark protocol for probing language‑model preferences, including experiments with both Swedish‑only and multilingual LLMs.
  • Empirical evidence that exposure to the English source nudges models toward literal translations, even when the source is hidden.
  • Open‑source release of the dataset, annotation guidelines, and evaluation scripts to foster reproducible research.

Methodology

  1. Data collection – Professional translators produced Swedish sentences from English source texts. A second set of native Swedish writers rewrote each sentence to sound idiomatic while preserving meaning.
  2. Annotation – Each translationese sentence received one or more error tags from a predefined taxonomy (e.g., “lexical calque”, “ungrammatical word order”).
  3. Probing setup – The authors built a multiple‑choice test: given a Swedish sentence and optionally the English source, a language model must pick the more natural variant.
  4. Model suite – Experiments covered small Swedish‑trained models (e.g., Swedish‑GPT‑2) and larger multilingual LLMs (e.g., mBERT, XLM‑R, LLaMA‑13B).
  5. Metrics – Preference accuracy (percentage of times the model selects the idiomatic version) and analysis of error‑type sensitivity.

Results & Findings

  • Baseline bias toward translationese: Across all models, the idiomatic choice was selected only ≈ 38 % of the time when the English source was provided, indicating a strong literal‑translation preference.
  • Source removal helps: When the English sentence was omitted, accuracy rose to ≈ 55 %, showing that models can better judge naturalness without source interference.
  • Model size matters: Larger multilingual models (≥ 13 B parameters) showed a modest improvement over smaller ones, but still favored translationese in > 40 % of cases.
  • Error‑type sensitivity: Models were more likely to pick the idiomatic version for overt lexical calques than for subtle syntactic issues, suggesting that surface‑level oddities are easier for LLMs to spot.

Practical Implications

  • Machine translation post‑editing: The dataset can be used to fine‑tune or evaluate MT systems that aim to produce native‑like output, reducing the need for costly human post‑editing.
  • Multilingual assistants: Voice assistants and chatbots that generate Swedish responses can be benchmarked against this resource to avoid sounding “translated” and improve user trust.
  • Curriculum for LLM fine‑tuning: Developers can incorporate the error tags as auxiliary supervision signals, teaching models to recognize and avoid common translationese patterns.
  • Quality estimation tools: The corpus offers a testbed for building automatic metrics that distinguish literal from idiomatic translations, useful for MT quality dashboards.

Limitations & Future Work

  • Domain coverage: The current sentences stem from news and Wikipedia; other domains (legal, medical) may exhibit different translationese characteristics.
  • Language pair focus: Only English‑to‑Swedish is covered; extending the methodology to other language pairs would test the generality of the findings.
  • Model diversity: Experiments were limited to a handful of publicly available LLMs; proprietary or newer architectures could behave differently.
  • Human evaluation depth: While the dataset includes error tags, a large‑scale human preference study could further validate the “idiomatic” label.

The release of this dataset opens a practical avenue for developers to build translation systems that sound truly native, moving beyond the literal shadows of the source language.

Authors

  • Jenny Kunz
  • Anja Jarochenko
  • Marcel Bollmann

Paper Information

  • arXiv ID: 2603.08450v1
  • Categories: cs.CL
  • Published: March 9, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »