[Paper] Adapting Large Language Models to Low-Resource Tibetan: A Two-Stage Continual and Supervised Fine-Tuning Study

Published: 1 month ago (December 3, 2025 at 12:06 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.03976v1

Overview

This paper tackles a practical problem that many developers face: how to make a powerful large language model (LLM) work well for a language that has very little digital text—Tibetan. By fine‑tuning the open‑source Qwen2.5‑3B model in two steps, the authors show that you can dramatically improve both the model’s general language understanding (lower perplexity) and its ability to translate from Chinese to Tibetan.

Key Contributions

Two‑stage adaptation pipeline – first a Continual Pre‑training (CPT) on raw Tibetan corpora, then Supervised Fine‑Tuning (SFT) on translation and downstream tasks.
Quantitative baseline for Tibetan – the first systematic evaluation of LLM adaptation dynamics for Tibetan, including perplexity and translation metrics (BLEU, chrF).
Layer‑wise analysis at scale – inspection of 435 layers in the larger Qwen‑3‑4B model reveals where knowledge is stored (embeddings & output heads) and how task‑specific changes propagate through mid‑late MLP layers.
Open, reproducible framework – all data preprocessing scripts, training configs, and evaluation code are released, enabling other teams to replicate the workflow for any low‑resource language.

Methodology

Data collection – the authors gathered ~1.2 GB of Tibetan text from web crawls, religious scriptures, and community forums, then cleaned and tokenized it with a Tibetan‑aware tokenizer.
Continual Pre‑training (CPT) – the base Qwen2.5‑3B model continues its language‑model training on the Tibetan corpus only. This step builds a “Tibetan semantic manifold” without overwriting the multilingual knowledge already encoded in the model.
Supervised Fine‑Tuning (SFT) – a parallel dataset of Chinese‑to‑Tibetan sentence pairs (≈30 k examples) and a small set of classification/QA tasks in Tibetan are used to teach the model how to produce useful outputs for specific applications.
Evaluation – perplexity on a held‑out Tibetan test set measures general language modeling ability; BLEU and chrF scores assess translation quality. For deeper insight, the authors probe activations across every layer of a larger 4‑billion‑parameter sibling model.

The pipeline is deliberately simple: no architectural changes, just careful data curation and staged training, which makes it easy to adopt with existing open‑source tooling (e.g., Hugging Face Transformers, DeepSpeed).

Results & Findings

Metric	Baseline (Qwen2.5‑3B)	After CPT	After CPT + SFT
Perplexity (Tibetan)	2.98	1.54	1.48
BLEU (Zh→Ti)	0.046	0.172	0.261
chrF (Zh→Ti)	2.2	4.8	6.6

Perplexity drops by ~48 % after CPT, indicating the model now “understands” Tibetan syntax and morphology much better.
Translation quality more than triples after the full two‑stage process, moving from near‑random to a level that can be useful for draft translations.
Layer analysis shows that CPT mainly reshapes the embedding matrix and the final language‑model head, while SFT introduces nuanced changes in the middle MLP layers that specialize the model for translation. Importantly, the earlier layers remain relatively stable, suggesting that the multilingual foundation is preserved.

Practical Implications

Rapid localization – Companies looking to add Tibetan (or any low‑resource language) to their chatbots, search, or content‑moderation pipelines can follow this two‑stage recipe instead of training a model from scratch.
Cost‑effective fine‑tuning – CPT can be run on a single GPU for a few days on modest data, and SFT requires only a few thousand parallel sentences, which many NGOs or community groups can collect.
Transferable insights – The layer‑wise findings give developers clues about where to “inject” language‑specific knowledge (embeddings) and where to focus task‑specific heads, informing future parameter‑efficient adaptation methods like LoRA or adapters.
Open‑source ecosystem boost – By releasing the scripts and checkpoints, the authors lower the barrier for open‑source LLMs to serve underrepresented languages, aligning with responsible AI and digital inclusion goals.

Limitations & Future Work

Data size & diversity – Even with 1.2 GB of Tibetan text, the corpus is still narrow (mostly religious and formal domains), which may limit performance on colloquial or domain‑specific use cases.
Evaluation scope – The study focuses on Chinese‑to‑Tibetan translation; broader downstream tasks (e.g., summarization, question answering) remain untested.
Scalability to even larger models – While a 4‑billion‑parameter model was probed, the actual fine‑tuning experiments were limited to the 3‑billion‑parameter Qwen2.5. Exploring whether the same gains hold for 10‑B or 70‑B models is an open question.
Cross‑lingual drift – The authors note a slight increase in perplexity on other languages after CPT, hinting at a trade‑off between specialization and multilingual retention that future work could address with multi‑task continual training.

Bottom line: The paper provides a pragmatic, reproducible roadmap for adapting LLMs to low‑resource languages—turning a theoretical challenge into a set of concrete steps that developers can start using today.

Authors

Lifeng Chen
Ryan Lai
Tianming Liu

Paper Information

arXiv ID: 2512.03976v1
Categories: cs.CL
Published: December 3, 2025
PDF: Download PDF

[Paper] Adapting Large Language Models to Low-Resource Tibetan: A Two-Stage Continual and Supervised Fine-Tuning Study

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

SoftBank and Nvidia reportedly in talks to fund Skild AI at $14B, nearly tripling its value

Google’s AI try-on app Doppl adds a shoppable discovery feed

Google says there are ‘no plans’ to put ads in the Gemini app

Gemini for Home update already working on some third-party Google Assistant speakers