[Paper] Towards Nepali-language LLMs: Efficient GPT training with a Nepali BPE tokenizer
Source: arXiv - 2512.14585v1
Overview
A new research effort tackles the long‑standing scarcity of high‑quality Nepali language models by building a GPT‑style generative model that can produce fluent Nepali text. By combining a custom Nepali‑only BPE tokenizer, modern training tricks from GPT‑3, and memory‑efficient FlashAttention, the authors demonstrate that a relatively modest‑size model can already generate coherent news‑style Nepali sentences.
Key Contributions
- Dedicated Nepali BPE tokenizer (16 k vocab) trained exclusively on Nepali corpora, yielding more consistent sub‑word splits than multilingual tokenizers.
- GPT‑2‑based architecture fine‑tuned with GPT‑3‑inspired training regimes (scaled batch size, cosine learning‑rate decay, warm‑up, and architectural tweaks).
- Efficient training pipeline using FlashAttention to cut GPU memory usage by ~30 % while keeping training stable.
- Large‑scale Nepali pre‑training data: 10.75 GB cleaned NepBERTa corpus + web‑scraped Nepali news articles (≈12 GB total).
- Empirical results: after just two epochs the model reaches a training loss of 3.168, validation loss of 3.082 and a perplexity of 21.80 on held‑out Nepali text.
Methodology
- Data collection & cleaning – The authors merged the publicly available NepBERTa dataset with a freshly scraped news corpus, then applied language‑specific cleaning (deduplication, script normalization, removal of non‑Devanagari characters).
- Tokenizer design – Using the combined corpus, a Byte‑Pair Encoding tokenizer with a 16 k vocabulary was trained. Because it only sees Nepali, common morphemes and agglutinative suffixes are captured more reliably than in multilingual tokenizers.
- Model architecture – A standard GPT‑2 transformer (12 layers, 768 hidden size, 12 heads) was adopted. Minor refinements included layer‑norm placement and a slightly larger feed‑forward dimension to better handle Nepali’s rich morphology.
- Training tricks
- Learning‑rate schedule: linear warm‑up (10 k steps) → cosine decay.
- Batch scaling: gradient accumulation to simulate large batch sizes without exceeding GPU memory.
- FlashAttention: a kernel that computes attention in a memory‑friendly way, allowing the same model to train on 24 GB GPUs.
- Training regime – The model was trained for two full passes (epochs) over the ~12 GB dataset on a cluster of 8 × A100 GPUs.
Results & Findings
| Metric | Value |
|---|---|
| Training loss | 3.168 |
| Validation loss | 3.082 |
| Perplexity (validation) | 21.80 |
| Sample output | “काठमाडौंका प्रमुख समाचारहरू अनुसार, सरकारले नयाँ बजेट योजना घोषणा गर्यो…” (a fluent news‑style sentence) |
- Low perplexity indicates the model predicts Nepali tokens with confidence comparable to early GPT‑2 models on English.
- Qualitative inspection shows the model respects Nepali syntax, correctly handles post‑positions, and produces appropriate honorifics—areas where prior encoder‑only models struggled.
- Training efficiency: FlashAttention reduced per‑step memory by ~30 % and cut wall‑clock time by ~15 % relative to vanilla attention.
Practical Implications
- Content generation: Media outlets can prototype automated news briefs, summaries, or social‑media posts in Nepali without resorting to English‑to‑Nepali translation pipelines.
- Conversational agents: Chatbots and voice assistants built for Nepal can now rely on a generative backbone that produces natural‑sounding replies, improving user experience.
- Low‑resource fine‑tuning: Because the base model already captures Nepali morphology, downstream tasks (summarization, question answering) can be fine‑tuned with far fewer labeled examples than multilingual LLMs.
- Open‑source ecosystem: The tokenizer and training scripts are lightweight enough to run on a single high‑end GPU, encouraging community contributions and domain‑specific extensions (e.g., legal or medical Nepali text).
Limitations & Future Work
- Scale: The model is still a GPT‑2‑size network; larger architectures could push perplexity lower and improve long‑form coherence.
- Data diversity: Training data is dominated by news text; other domains (literature, informal social media) are under‑represented, which may limit style transfer.
- Evaluation breadth: The paper reports loss and perplexity but lacks human‑rated benchmarks for factuality, bias, or toxicity in Nepali.
- Future directions suggested by the authors include: scaling to GPT‑3‑level parameters, incorporating multilingual code‑switching data (common in Nepal), and releasing a benchmark suite for Nepali generative tasks.
Authors
- Adarsha Shrestha
- Basanta Pokharel
- Binit Shrestha
- Smriti Adhikari
- Dinesh Gothe
Paper Information
- arXiv ID: 2512.14585v1
- Categories: cs.CL, cs.AI
- Published: December 16, 2025
- PDF: Download PDF