[Paper] Form and Meaning in Intrinsic Multilingual Evaluations

Published: (January 15, 2026 at 11:53 AM EST)
5 min read
Source: arXiv

Source: arXiv - 2601.10580v1

Overview

The paper Form and Meaning in Intrinsic Multilingual Evaluations takes a hard look at the most common “intrinsic” metrics—perplexity, bits‑per‑character, and their cousins—that researchers use to judge conditional language models (CLMs). While these numbers are easy to compute and compare in a single language, the authors show that the same assumptions break down when we evaluate models across multiple languages on parallel data. In short, a lower perplexity on a French sentence doesn’t necessarily mean the model understands the same meaning as a higher‑perplexity English counterpart.

Key Contributions

  • Explicitly surface hidden assumptions behind multilingual perplexity‑based evaluation (e.g., that parallel sentences share identical semantic content).
  • Systematic empirical study of six intrinsic metrics across two large multi‑parallel corpora (Europarl and JRC‑A‑cquis) using both monolingual and multilingual CLMs.
  • Demonstrate non‑universality: metric scores are not directly comparable across languages or model families.
  • Link the findings to the “form‑meaning” debate in linguistics, offering a conceptual lens for why information‑theoretic metrics diverge in multilingual settings.
  • Provide practical recommendations for researchers and engineers on when (and when not) to rely on standard intrinsic metrics for multilingual model assessment.

Methodology

  1. Datasets – The authors selected two well‑known multi‑parallel corpora:

    • Europarl (European Parliament proceedings) covering 21 languages.
    • JRC‑A‑cquis (EU legal texts) covering 23 languages.
      These corpora contain sentence‑aligned translations, which is the ideal test‑bed for the “same meaning, different form” hypothesis.
  2. Models – Four model families were evaluated:

    • Monolingual CLMs (one model per language).
    • Multilingual CLMs (single model trained on all languages).
    • Both autoregressive (e.g., GPT‑style) and seq2seq (e.g., T5‑style) architectures were included to see if architecture matters.
  3. Metrics – Six intrinsic metrics were computed on the parallel sentences:

    • Perplexity (PPL)
    • Bits‑per‑character (BPC)
    • Negative log‑likelihood (NLL)
    • Token‑level cross‑entropy
    • Normalized sequence‑level entropy
    • A recently proposed semantic‑aware perplexity (which weights tokens by a multilingual embedding similarity).
  4. Experimental Procedure – For each language pair, the same set of parallel sentences was fed to the relevant model(s). The authors then compared metric values across languages, across model types, and across metrics, looking for systematic patterns or divergences.

  5. Analysis Framework – The results were interpreted through the lens of information theory (bits = information content) and linguistic form‑meaning theory (the idea that surface form and underlying meaning can diverge across languages).

Results & Findings

MetricMonolingual vs. Multilingual (same language)Cross‑language comparabilitySemantic‑aware PPL
PerplexityMultilingual models generally higher PPL (worse) than monolingual ones, but the gap varies wildly per language.No consistent ordering; e.g., French PPL < German PPL for one model, but the reverse for another.Correlates better with human semantic similarity scores, but still not fully comparable across languages.
BPCSimilar trends to PPL; highly sensitive to tokenization differences.Inconsistent across scripts (Latin vs. Cyrillic).Improves alignment but still penalizes languages with richer morphology.
NLL / Cross‑entropyMirrors PPL patterns; differences amplified for low‑resource languages.Large variance; low‑resource languages often appear “easier” (lower NLL) simply because vocabularies are smaller.Reduces variance but introduces dependence on multilingual embeddings.

Key takeaways

  • Metric scores are not language‑agnostic: a low perplexity in one language does not guarantee comparable semantic fidelity in another.
  • Multilingual models do not uniformly dominate monolingual ones; they sometimes produce higher perplexities even when they capture meaning better.
  • Semantic‑aware perplexity narrows the gap but still cannot fully resolve the comparability issue.
  • The form‑meaning mismatch (e.g., agglutinative vs. analytic languages) explains why pure information‑theoretic measures diverge: they capture surface entropy, not underlying meaning equivalence.

Practical Implications

  • Model selection: When choosing a multilingual CLM for production (e.g., a translation‑assist tool), don’t rely solely on perplexity or bits‑per‑character as a “one‑size‑fits‑all” score. Complement intrinsic metrics with task‑specific downstream evaluations (BLEU, METEOR, human rating).
  • Benchmark design: Teams building multilingual benchmarks should report per‑language baselines and avoid aggregating perplexity across languages without normalization.
  • Tokenization strategy: The study highlights how tokenization (subword vs. character) can inflate or deflate metric values, especially for morphologically rich languages. Consider language‑specific tokenizers or byte‑level models when comparing across languages.
  • Monitoring production models: For services that serve many languages (e.g., chatbots), tracking a semantic‑aware metric alongside traditional perplexity can give early warnings about meaning drift even when surface‑level scores look healthy.
  • Research pipelines: The findings encourage the community to develop multilingual intrinsic metrics that factor in cross‑lingual semantic similarity, perhaps leveraging multilingual sentence embeddings (e.g., LASER, MUSE) as a weighting factor.

Limitations & Future Work

  • Scope of languages: The experiments focus on European languages with relatively high‑quality parallel corpora; results may differ for low‑resource or non‑Indo‑European languages.
  • Metric set: Only six intrinsic metrics were examined; newer measures (e.g., contrastive loss‑based scores) remain unexplored.
  • Semantic‑aware perplexity relies on pre‑trained multilingual embeddings, which themselves inherit biases and may not perfectly capture meaning across domains.
  • Future directions suggested by the authors include:
    • Extending the analysis to non‑parallel multilingual evaluation (e.g., cross‑lingual retrieval).
    • Designing information‑theoretic metrics that explicitly separate form from meaning, perhaps via disentangled representation learning.
    • Conducting human studies to validate which intrinsic scores best predict perceived translation quality across languages.

By surfacing these hidden assumptions and providing concrete evidence of their impact, the paper equips developers and researchers with a more nuanced toolkit for evaluating multilingual language models—moving beyond “lower perplexity is always better” toward a richer, meaning‑aware assessment.

Authors

  • Wessel Poelman
  • Miryam de Lhoneux

Paper Information

  • arXiv ID: 2601.10580v1
  • Categories: cs.CL
  • Published: January 15, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »