[Paper] Adapting Large Language Models to Low-Resource Tibetan: A Two-Stage Continual and Supervised Fine-Tuning Study
Source: arXiv - 2512.03976v1
Overview
This paper tackles a practical problem that many developers face: how to make a powerful large language model (LLM) work well for a language that has very little digital text—Tibetan. By fine‑tuning the open‑source Qwen2.5‑3B model in two steps, the authors show that you can dramatically improve both the model’s general language understanding (lower perplexity) and its ability to translate from Chinese to Tibetan.
Key Contributions
- Two‑stage adaptation pipeline – first a Continual Pre‑training (CPT) on raw Tibetan corpora, then Supervised Fine‑Tuning (SFT) on translation and downstream tasks.
- Quantitative baseline for Tibetan – the first systematic evaluation of LLM adaptation dynamics for Tibetan, including perplexity and translation metrics (BLEU, chrF).
- Layer‑wise analysis at scale – inspection of 435 layers in the larger Qwen‑3‑4B model reveals where knowledge is stored (embeddings & output heads) and how task‑specific changes propagate through mid‑late MLP layers.
- Open, reproducible framework – all data preprocessing scripts, training configs, and evaluation code are released, enabling other teams to replicate the workflow for any low‑resource language.
Methodology
- Data collection – the authors gathered ~1.2 GB of Tibetan text from web crawls, religious scriptures, and community forums, then cleaned and tokenized it with a Tibetan‑aware tokenizer.
- Continual Pre‑training (CPT) – the base Qwen2.5‑3B model continues its language‑model training on the Tibetan corpus only. This step builds a “Tibetan semantic manifold” without overwriting the multilingual knowledge already encoded in the model.
- Supervised Fine‑Tuning (SFT) – a parallel dataset of Chinese‑to‑Tibetan sentence pairs (≈30 k examples) and a small set of classification/QA tasks in Tibetan are used to teach the model how to produce useful outputs for specific applications.
- Evaluation – perplexity on a held‑out Tibetan test set measures general language modeling ability; BLEU and chrF scores assess translation quality. For deeper insight, the authors probe activations across every layer of a larger 4‑billion‑parameter sibling model.
The pipeline is deliberately simple: no architectural changes, just careful data curation and staged training, which makes it easy to adopt with existing open‑source tooling (e.g., Hugging Face Transformers, DeepSpeed).
Results & Findings
| Metric | Baseline (Qwen2.5‑3B) | After CPT | After CPT + SFT |
|---|---|---|---|
| Perplexity (Tibetan) | 2.98 | 1.54 | 1.48 |
| BLEU (Zh→Ti) | 0.046 | 0.172 | 0.261 |
| chrF (Zh→Ti) | 2.2 | 4.8 | 6.6 |
- Perplexity drops by ~48 % after CPT, indicating the model now “understands” Tibetan syntax and morphology much better.
- Translation quality more than triples after the full two‑stage process, moving from near‑random to a level that can be useful for draft translations.
- Layer analysis shows that CPT mainly reshapes the embedding matrix and the final language‑model head, while SFT introduces nuanced changes in the middle MLP layers that specialize the model for translation. Importantly, the earlier layers remain relatively stable, suggesting that the multilingual foundation is preserved.
Practical Implications
- Rapid localization – Companies looking to add Tibetan (or any low‑resource language) to their chatbots, search, or content‑moderation pipelines can follow this two‑stage recipe instead of training a model from scratch.
- Cost‑effective fine‑tuning – CPT can be run on a single GPU for a few days on modest data, and SFT requires only a few thousand parallel sentences, which many NGOs or community groups can collect.
- Transferable insights – The layer‑wise findings give developers clues about where to “inject” language‑specific knowledge (embeddings) and where to focus task‑specific heads, informing future parameter‑efficient adaptation methods like LoRA or adapters.
- Open‑source ecosystem boost – By releasing the scripts and checkpoints, the authors lower the barrier for open‑source LLMs to serve underrepresented languages, aligning with responsible AI and digital inclusion goals.
Limitations & Future Work
- Data size & diversity – Even with 1.2 GB of Tibetan text, the corpus is still narrow (mostly religious and formal domains), which may limit performance on colloquial or domain‑specific use cases.
- Evaluation scope – The study focuses on Chinese‑to‑Tibetan translation; broader downstream tasks (e.g., summarization, question answering) remain untested.
- Scalability to even larger models – While a 4‑billion‑parameter model was probed, the actual fine‑tuning experiments were limited to the 3‑billion‑parameter Qwen2.5. Exploring whether the same gains hold for 10‑B or 70‑B models is an open question.
- Cross‑lingual drift – The authors note a slight increase in perplexity on other languages after CPT, hinting at a trade‑off between specialization and multilingual retention that future work could address with multi‑task continual training.
Bottom line: The paper provides a pragmatic, reproducible roadmap for adapting LLMs to low‑resource languages—turning a theoretical challenge into a set of concrete steps that developers can start using today.
Authors
- Lifeng Chen
- Ryan Lai
- Tianming Liu
Paper Information
- arXiv ID: 2512.03976v1
- Categories: cs.CL
- Published: December 3, 2025
- PDF: Download PDF