[Paper] Differences in Typological Alignment in Language Models' Treatment of Differential Argument Marking

Published: (February 19, 2026 at 01:56 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.17653v1

Overview

A new study probes how large language models (LLMs) internalize subtle typological patterns found in human languages—specifically, differential argument marking (DAM), a system where case morphology is applied only to certain arguments based on their semantic prominence. By training GPT‑2 on carefully crafted synthetic languages, the authors uncover which cross‑linguistic tendencies the models can pick up and which they miss, shedding light on the cognitive biases baked into modern LLMs.

Key Contributions

  • Synthetic DAM corpora: Created 18 miniature “languages,” each implementing a distinct DAM system (different combinations of which arguments are overtly marked).
  • Controlled training regime: Fine‑tuned GPT‑2 on each synthetic corpus, ensuring the model’s exposure is limited to the targeted typological feature.
  • Minimal‑pair evaluation: Designed probing sentences that isolate the model’s preference for marking subjects vs. objects and for marking “typical” vs. “atypical” arguments.
  • Typological dissociation: Demonstrated that LLMs reliably learn the markedness direction (favoring overt marking of semantically atypical arguments) but fail to capture the object‑bias (the human tendency to mark objects more often than subjects).
  • Interpretive framework: Argues that different typological regularities may stem from distinct learning pressures—some emergent from data distribution, others requiring deeper semantic grounding.

Methodology

  1. Design of synthetic languages – The authors built 18 artificial grammars, each encoding a unique DAM rule (e.g., “mark the object when it is animate” vs. “mark the subject when it is low‑prominence”). Vocabulary, word order, and other syntax were held constant to isolate DAM.
  2. Model training – A base GPT‑2 (124 M parameters) was fine‑tuned on each corpus for a few epochs, guaranteeing that the model sees only the intended DAM patterns.
  3. Probing with minimal pairs – For every trained model, the team generated sentence pairs that differ only in the presence/absence of case marking on the subject or object. The model’s next‑token probabilities were used to infer a preference for one marking pattern over the other.
  4. Statistical analysis – Preference scores were aggregated across corpora and compared to typological statistics from the World Atlas of Language Structures (WALS) to see where the model aligns with human languages.

Results & Findings

  • Markedness direction aligns with humans: Across all 18 DAM systems, GPT‑2 consistently prefers to mark the semantically atypical argument (e.g., a low‑prominence subject) over the typical one, mirroring the universal tendency observed in natural languages.
  • Object‑bias absent: Unlike human languages, where overt marking more often targets objects, the models show no systematic preference for object marking; their scores hover around chance.
  • Consistency across corpora: The markedness effect holds regardless of whether the language marks subjects, objects, or both, suggesting the bias is robust to surface variations.
  • Implication of separate sources: The split performance hints that the markedness direction may be learned from simple distributional cues, while the object‑bias likely requires richer semantic or pragmatic information that the synthetic training regime does not provide.

Practical Implications

  • Better multilingual LM diagnostics: Understanding which typological patterns LLMs naturally acquire can guide developers when fine‑tuning models for low‑resource languages that rely heavily on DAM (e.g., many Turkic or Austronesian languages).
  • Targeted data augmentation: Since the object‑bias does not emerge automatically, practitioners can inject curated examples emphasizing object marking to improve downstream tasks like case‑sensitive parsing or machine translation.
  • Explainability for downstream NLP: Knowing that a model’s DAM preferences are driven by markedness rather than argument role can help debug errors in syntactic generation or morphological inflection systems.
  • Design of synthetic pre‑training curricula: The study showcases a scalable recipe for probing other typological phenomena (e.g., agreement, evidentiality) before committing to large‑scale multilingual pre‑training, saving compute and data collection costs.

Limitations & Future Work

  • Synthetic vs. natural data: The corpora are highly controlled and lack the noisy, lexical diversity of real languages, so findings may not transfer directly to fully natural settings.
  • Model size and architecture: Only a single GPT‑2 variant was examined; larger models or encoder‑decoder architectures might exhibit different DAM behaviors.
  • Semantic depth: The current setup does not test whether models truly understand the underlying semantics of “prominence,” leaving open the question of whether deeper grounding would yield the missing object‑bias.
  • Broader typological scope: Future research could extend the methodology to other cross‑linguistic dimensions (e.g., split ergativity, evidential systems) and explore interactions between multiple typological features.

Authors

  • Iskar Deng
  • Nathalia Xu
  • Shane Steinert-Threlkeld

Paper Information

  • arXiv ID: 2602.17653v1
  • Categories: cs.CL
  • Published: February 19, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »