[Paper] Pretrained Multilingual Transformers Reveal Quantitative Distance Between Human Languages
Source: arXiv - 2603.17912v1
Overview
A new study shows that the attention patterns inside large multilingual transformers can be turned into a quantitative yardstick for measuring how far apart human languages are. By treating attention maps as probability distributions and comparing them with optimal‑transport math, the authors create an “Attention Transport Distance” (ATD) that mirrors classic linguistic groupings while also being useful for improving low‑resource machine translation.
Key Contributions
- Attention Transport Distance (ATD): A tokenization‑agnostic metric that derives language distance directly from the attention matrices of pretrained multilingual models.
- Empirical validation: ATD reproduces well‑known language families (e.g., Romance, Slavic) and captures geographic/contact effects that traditional typological tables miss.
- Practical boost for MT: Using ATD as a regularizer during fine‑tuning yields measurable gains on low‑resource translation pairs.
- Open‑source toolkit: The authors release code for extracting attention, computing ATD, and visualizing language‑distance graphs, enabling reproducible research and rapid prototyping.
Methodology
- Model selection: The authors start from publicly available multilingual Transformers (e.g., mBART, mT5) that have already been trained on massive parallel corpora.
- Attention extraction: For a given source‑target language pair, they feed a set of parallel sentences through the model and collect the attention weight matrices from every head and layer.
- Distribution view: Each attention matrix is normalized to sum to 1, turning it into a discrete probability distribution over token positions.
- Optimal‑transport comparison: The geometric divergence between two languages’ attention distributions is measured with the Wasserstein distance (a.k.a. Earth Mover’s Distance). This yields a single scalar—ATD—that reflects how the model “shifts” attention when translating between the two languages.
- Aggregation: ATD scores are averaged across heads, layers, and sentence batches to obtain a stable language‑pair distance.
- Evaluation pipeline: The resulting distance matrix is fed into clustering and dimensionality‑reduction tools (e.g., hierarchical clustering, t‑SNE) to compare against known linguistic families and to test downstream MT performance.
Results & Findings
- Clustering aligns with typology: Hierarchical clustering of ATD distances groups languages almost exactly as the Indo‑European, Afro‑Asiatic, and Austronesian families do in standard linguistic literature.
- Geographic signal: Languages that are geographically close but belong to different families (e.g., Turkish and Kurdish) show smaller ATD than distant members of the same family, indicating the metric captures contact‑induced convergence.
- Low‑resource MT gains: Adding an ATD‑based regularization term during fine‑tuning improves BLEU scores by 1.2–2.5 points on several low‑resource language pairs (e.g., Swahili↔English, Nepali↔Hindi).
- Robustness to tokenization: Because ATD works on the raw attention matrices, the metric remains stable across different subword vocabularies and even when languages use different scripts.
Practical Implications
- Better language selection for transfer learning: Developers can use ATD to pick the most “similar” high‑resource language when building a new translation system, reducing the need for costly data collection.
- Curriculum design for multilingual models: ATD can guide the ordering of language exposure during multilingual pre‑training, potentially leading to more balanced representations across languages.
- Diagnostic tool for bias: By quantifying how far a model’s internal geometry drifts from a target language, ATD can flag under‑represented languages that may suffer from poorer quality or higher error rates.
- Cross‑lingual retrieval & clustering: ATD can be repurposed for tasks like multilingual document clustering, language‑aware search, or even sociolinguistic studies that need a scalable similarity measure.
Limitations & Future Work
- Dependence on pretrained models: ATD inherits any biases present in the underlying multilingual transformer (e.g., over‑representation of English‑centric data).
- Computational cost: Extracting and processing attention matrices for many language pairs is memory‑intensive; the authors suggest sampling strategies but full‑scale deployment still requires significant resources.
- Scope of languages: The experiments focus mainly on languages covered by the pretraining corpora; truly low‑resource or under‑documented languages may lack sufficient attention data for reliable ATD estimates.
- Future directions: Extending ATD to other model families (e.g., encoder‑only models), integrating phonological or morphological features, and exploring dynamic, context‑dependent distance measures are highlighted as promising next steps.
Authors
- Yue Zhao
- Jiatao Gu
- Paloma Jeretič
- Weijie Su
Paper Information
- arXiv ID: 2603.17912v1
- Categories: cs.CL, stat.ML
- Published: March 18, 2026
- PDF: Download PDF