[Paper] What Kind of Language is Easy to Language-Model Under Curriculum Learning?
Source: arXiv - 2604.26844v1
Overview
The paper investigates how the learning schedule—specifically, curriculum learning (CL)—shapes the inductive biases of neural language models (LMs) when they are exposed to typologically diverse languages. By feeding models easier sentences first, the authors show that CL can dramatically alter which linguistic patterns the model finds “easy” to learn, offering a fresh angle on why certain language structures are more common across the world’s languages.
Key Contributions
- Introduces curriculum learning as a variable in LM‑based typology research, a factor previously overlooked.
- Demonstrates that CL changes the apparent inductive bias of standard transformer LMs, making them favor different word‑order patterns than when trained on randomly ordered data.
- Provides a reproducible experimental framework (synthetic corpora spanning the full typological space of word‑order configurations) that can be reused for future CL studies.
- Offers empirical evidence that developmental‑style training regimes can bring LM behavior closer to human language acquisition patterns.
Methodology
- Synthetic Language Generation – The authors programmatically create artificial languages that systematically vary in key typological features (e.g., subject‑object‑verb vs. object‑verb‑subject order).
- Curriculum Design – Two training regimes are compared:
- Random: sentences are shuffled, mimicking the usual LM training pipeline.
- Curriculum: sentences are sorted by syntactic complexity (short, simple clauses first; longer, nested structures later).
- Model Architecture – Standard transformer‑based language models (similar to GPT‑2 size) are trained from scratch on each synthetic language under both regimes.
- Evaluation – After training, the models are probed for their ability to predict word order and other surface features, allowing the researchers to infer the “preferred” typological patterns each model has internalized.
Results & Findings
- Curriculum learning leads to a shift in bias: Models trained with CL tend to internalize the more globally common word‑order patterns (e.g., SVO) even when the underlying data distribution is balanced across all orders.
- Random training preserves the data distribution: Without a curriculum, models reflect the exact frequencies of the training data, showing no systematic preference for typologically common orders.
- Complexity‑based ordering matters: The benefit is strongest when the curriculum progresses from truly simple syntactic constructions to more complex ones; a naïve “short‑sentence first” schedule yields weaker effects.
- Generalization improves: CL‑trained models achieve higher perplexity on held‑out sentences that combine familiar simple structures in novel ways, suggesting better abstraction of underlying grammatical rules.
Practical Implications
- Better multilingual model pre‑training – Incorporating a curriculum that mirrors natural language acquisition (starting with simple utterances) could help large‑scale multilingual LMs acquire more human‑like biases, potentially improving low‑resource language performance.
- Curriculum‑aware fine‑tuning – When adapting a pre‑trained LM to a specific domain or language, ordering the fine‑tuning data from easy to hard may yield faster convergence and more robust generalization.
- Tooling for language technology – Developers building grammar checkers, parsers, or speech‑to‑text systems for typologically rare languages can leverage CL to bias models toward more “natural” structures, reducing the need for massive annotated corpora.
- Insights for AI safety – Understanding how training schedules affect model inductive biases helps anticipate emergent behaviors, a key concern when deploying LMs in high‑stakes applications.
Limitations & Future Work
- Synthetic data only – The study relies on artificially generated languages; real‑world noise, lexical irregularities, and sociolinguistic factors are not captured.
- Single architecture – Experiments are limited to standard transformers; it remains open whether CL has similar effects on recurrent or newer architectures (e.g., retrieval‑augmented models).
- Curriculum design space – Only one notion of “simplicity” (syntactic depth) is explored. Future work could test semantic, morphological, or frequency‑based curricula.
- Long‑term learning dynamics – The paper does not examine how curriculum effects evolve with continued training beyond the initial convergence point.
Authors
- Nadine El‑Naggar
- Tatsuki Kuribayashi
- Ted Briscoe
Paper Information
- arXiv ID: 2604.26844v1
- Categories: cs.CL
- Published: April 29, 2026
- PDF: Download PDF