[Paper] Pretrained Multilingual Transformers Reveal Quantitative Distance Between Human Languages

Published: 2 days ago (March 18, 2026 at 12:50 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2603.17912v1

Overview

A new study shows that the attention patterns inside large multilingual transformers can be turned into a quantitative yardstick for measuring how far apart human languages are. By treating attention maps as probability distributions and comparing them with optimal‑transport math, the authors create an “Attention Transport Distance” (ATD) that mirrors classic linguistic groupings while also being useful for improving low‑resource machine translation.

Key Contributions

Attention Transport Distance (ATD): A tokenization‑agnostic metric that derives language distance directly from the attention matrices of pretrained multilingual models.
Empirical validation: ATD reproduces well‑known language families (e.g., Romance, Slavic) and captures geographic/contact effects that traditional typological tables miss.
Practical boost for MT: Using ATD as a regularizer during fine‑tuning yields measurable gains on low‑resource translation pairs.
Open‑source toolkit: The authors release code for extracting attention, computing ATD, and visualizing language‑distance graphs, enabling reproducible research and rapid prototyping.

Methodology

Model selection: The authors start from publicly available multilingual Transformers (e.g., mBART, mT5) that have already been trained on massive parallel corpora.
Attention extraction: For a given source‑target language pair, they feed a set of parallel sentences through the model and collect the attention weight matrices from every head and layer.
Distribution view: Each attention matrix is normalized to sum to 1, turning it into a discrete probability distribution over token positions.
Optimal‑transport comparison: The geometric divergence between two languages’ attention distributions is measured with the Wasserstein distance (a.k.a. Earth Mover’s Distance). This yields a single scalar—ATD—that reflects how the model “shifts” attention when translating between the two languages.
Aggregation: ATD scores are averaged across heads, layers, and sentence batches to obtain a stable language‑pair distance.
Evaluation pipeline: The resulting distance matrix is fed into clustering and dimensionality‑reduction tools (e.g., hierarchical clustering, t‑SNE) to compare against known linguistic families and to test downstream MT performance.

Results & Findings

Clustering aligns with typology: Hierarchical clustering of ATD distances groups languages almost exactly as the Indo‑European, Afro‑Asiatic, and Austronesian families do in standard linguistic literature.
Geographic signal: Languages that are geographically close but belong to different families (e.g., Turkish and Kurdish) show smaller ATD than distant members of the same family, indicating the metric captures contact‑induced convergence.
Low‑resource MT gains: Adding an ATD‑based regularization term during fine‑tuning improves BLEU scores by 1.2–2.5 points on several low‑resource language pairs (e.g., Swahili↔English, Nepali↔Hindi).
Robustness to tokenization: Because ATD works on the raw attention matrices, the metric remains stable across different subword vocabularies and even when languages use different scripts.

Practical Implications

Better language selection for transfer learning: Developers can use ATD to pick the most “similar” high‑resource language when building a new translation system, reducing the need for costly data collection.
Curriculum design for multilingual models: ATD can guide the ordering of language exposure during multilingual pre‑training, potentially leading to more balanced representations across languages.
Diagnostic tool for bias: By quantifying how far a model’s internal geometry drifts from a target language, ATD can flag under‑represented languages that may suffer from poorer quality or higher error rates.
Cross‑lingual retrieval & clustering: ATD can be repurposed for tasks like multilingual document clustering, language‑aware search, or even sociolinguistic studies that need a scalable similarity measure.

Limitations & Future Work

Dependence on pretrained models: ATD inherits any biases present in the underlying multilingual transformer (e.g., over‑representation of English‑centric data).
Computational cost: Extracting and processing attention matrices for many language pairs is memory‑intensive; the authors suggest sampling strategies but full‑scale deployment still requires significant resources.
Scope of languages: The experiments focus mainly on languages covered by the pretraining corpora; truly low‑resource or under‑documented languages may lack sufficient attention data for reliable ATD estimates.
Future directions: Extending ATD to other model families (e.g., encoder‑only models), integrating phonological or morphological features, and exploring dynamic, context‑dependent distance measures are highlighted as promising next steps.

Authors

Yue Zhao
Jiatao Gu
Paloma Jeretič
Weijie Su

Paper Information

arXiv ID: 2603.17912v1
Categories: cs.CL, stat.ML
Published: March 18, 2026
PDF: Download PDF

[Paper] Pretrained Multilingual Transformers Reveal Quantitative Distance Between Human Languages

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] FinTradeBench: A Financial Reasoning Benchmark for LLMs

[Paper] F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

[Paper] Online Learning and Equilibrium Computation with Ranking Feedback

[Paper] Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation