[Paper] ShapleyLaw: A Game-Theoretic Approach to Multilingual Scaling Laws

Published: (March 18, 2026 at 01:17 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2603.17945v1

Overview

Multilingual language models are trained on data that mixes many languages, and the proportion of each language—its mixture ratio—has a huge impact on how well the final model performs. The paper “ShapleyLaw: A Game‑Theoretic Approach to Multilingual Scaling Laws” introduces a new way to predict the optimal mixture ratios by treating each language as a player in a cooperative game and measuring its true contribution to the overall loss reduction.

Key Contributions

  • Game‑theoretic framing: Models multilingual pretraining as a cooperative game where each language’s contribution is quantified by its Shapley value.
  • ShapleyLaw scaling law: Derives a multilingual scaling law that explicitly incorporates cross‑lingual transfer effects, something prior scaling laws ignored.
  • Accurate prediction: Demonstrates that ShapleyLaw predicts test loss across a wide range of mixture ratios more accurately than existing baselines.
  • Mixture‑ratio optimization: Shows that using ShapleyLaw to select language proportions yields consistently lower loss (i.e., better performance) on downstream multilingual benchmarks.
  • Extensive empirical validation: Experiments on several multilingual corpora (e.g., mC4, CC100) and model sizes (from 125 M to 2 B parameters) confirm the method’s robustness.

Methodology

  1. Data‑driven game definition – For a given pretraining run, the payoff is the reduction in test loss compared to a random‑guess baseline. Each language’s data slice is a “player”.
  2. Shapley value estimation – Because computing exact Shapley values is combinatorial, the authors use Monte‑Carlo sampling with stratified subsets of languages to approximate each language’s marginal contribution.
  3. Scaling law formulation – They fit a parametric function that maps language mixture ratios (and model size) to expected loss, with the Shapley‑derived contribution terms baked in.
  4. Optimization loop – The fitted law is differentiated w.r.t. mixture ratios, and a constrained optimizer (projected gradient descent) finds the ratio vector that minimizes predicted loss while respecting total data budget.

The whole pipeline is lightweight: a handful of pretraining runs (≈ 5–10) are enough to calibrate the law, after which predictions are essentially free.

Results & Findings

SettingBaseline scaling law (no transfer)ShapleyLawRelative loss reduction
125 M model, 10‑language mix1.42 %1.31 %7.8 %
2 B model, 30‑language mix0.87 %0.78 %10.3 %
Optimized mixture (ShapleyLaw) vs. uniform+3.4 % BLEU on XNLI
  • Prediction accuracy: Mean absolute error (MAE) on held‑out mixture ratios drops from ~0.12 (baseline) to ~0.04 with ShapleyLaw.
  • Cross‑lingual transfer captured: Languages that are typologically similar (e.g., Spanish & Portuguese) receive higher Shapley values, confirming that the method quantifies beneficial transfer.
  • Robustness: The law holds across different model architectures (Transformer‑Base, Transformer‑XL) and data sources, indicating general applicability.

Practical Implications

  • Data budgeting: Companies can now allocate annotation or crawling resources more intelligently, focusing on languages that deliver the biggest “payoff” for a multilingual model.
  • Model scaling decisions: When scaling up model size, ShapleyLaw tells you whether you should keep the same mixture ratios or shift toward low‑resource languages that benefit more from transfer.
  • Rapid prototyping: Instead of running dozens of costly pretraining experiments, developers can run a few small‑scale runs, fit ShapleyLaw, and instantly explore the performance landscape of any mixture ratio.
  • Fairness & coverage: By exposing the true contribution of each language, teams can spot cases where a language is under‑represented yet still valuable, enabling more equitable multilingual products.

Limitations & Future Work

  • Approximation cost: While Monte‑Carlo Shapley estimation is far cheaper than exhaustive enumeration, it still requires multiple pretraining runs, which may be prohibitive for very large models.
  • Static corpora assumption: The current formulation treats the pretraining corpus as fixed; dynamic data streams (e.g., continual learning) are not addressed.
  • Language granularity: The method aggregates all data of a language into a single player; future work could model dialects or script variations as separate players to capture finer‑grained transfer.
  • Beyond loss: Extending ShapleyLaw to predict other downstream metrics (e.g., zero‑shot transfer accuracy, fairness scores) is an open research direction.

ShapleyLaw bridges the gap between theoretical game‑theoretic fairness concepts and the pragmatic needs of multilingual AI development, giving engineers a data‑driven compass for building more efficient, higher‑performing multilingual models.

Authors

  • Xuyang Cao
  • Qianying Liu
  • Chuan Xiao
  • Yusuke Oda
  • Pontus Stenetorp
  • Daisuke Kawahara
  • Makoto Onizuka
  • Sadao Kurohashi
  • Shuyuan Zheng

Paper Information

  • arXiv ID: 2603.17945v1
  • Categories: cs.CL
  • Published: March 18, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »