[Paper] Towards Nepali-language LLMs: Efficient GPT training with a Nepali BPE tokenizer

Published: 1 month ago (December 16, 2025 at 11:53 AM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.14585v1

Overview

A new research effort tackles the long‑standing scarcity of high‑quality Nepali language models by building a GPT‑style generative model that can produce fluent Nepali text. By combining a custom Nepali‑only BPE tokenizer, modern training tricks from GPT‑3, and memory‑efficient FlashAttention, the authors demonstrate that a relatively modest‑size model can already generate coherent news‑style Nepali sentences.

Key Contributions

Dedicated Nepali BPE tokenizer (16 k vocab) trained exclusively on Nepali corpora, yielding more consistent sub‑word splits than multilingual tokenizers.
GPT‑2‑based architecture fine‑tuned with GPT‑3‑inspired training regimes (scaled batch size, cosine learning‑rate decay, warm‑up, and architectural tweaks).
Efficient training pipeline using FlashAttention to cut GPU memory usage by ~30 % while keeping training stable.
Large‑scale Nepali pre‑training data: 10.75 GB cleaned NepBERTa corpus + web‑scraped Nepali news articles (≈12 GB total).
Empirical results: after just two epochs the model reaches a training loss of 3.168, validation loss of 3.082 and a perplexity of 21.80 on held‑out Nepali text.

Methodology

Data collection & cleaning – The authors merged the publicly available NepBERTa dataset with a freshly scraped news corpus, then applied language‑specific cleaning (deduplication, script normalization, removal of non‑Devanagari characters).
Tokenizer design – Using the combined corpus, a Byte‑Pair Encoding tokenizer with a 16 k vocabulary was trained. Because it only sees Nepali, common morphemes and agglutinative suffixes are captured more reliably than in multilingual tokenizers.
Model architecture – A standard GPT‑2 transformer (12 layers, 768 hidden size, 12 heads) was adopted. Minor refinements included layer‑norm placement and a slightly larger feed‑forward dimension to better handle Nepali’s rich morphology.
Training tricks
- Learning‑rate schedule: linear warm‑up (10 k steps) → cosine decay.
- Batch scaling: gradient accumulation to simulate large batch sizes without exceeding GPU memory.
- FlashAttention: a kernel that computes attention in a memory‑friendly way, allowing the same model to train on 24 GB GPUs.
Training regime – The model was trained for two full passes (epochs) over the ~12 GB dataset on a cluster of 8 × A100 GPUs.

Results & Findings

Metric	Value
Training loss	3.168
Validation loss	3.082
Perplexity (validation)	21.80
Sample output	“काठमाडौंका प्रमुख समाचारहरू अनुसार, सरकारले नयाँ बजेट योजना घोषणा गर्‍यो…” (a fluent news‑style sentence)

Low perplexity indicates the model predicts Nepali tokens with confidence comparable to early GPT‑2 models on English.
Qualitative inspection shows the model respects Nepali syntax, correctly handles post‑positions, and produces appropriate honorifics—areas where prior encoder‑only models struggled.
Training efficiency: FlashAttention reduced per‑step memory by ~30 % and cut wall‑clock time by ~15 % relative to vanilla attention.

Practical Implications

Content generation: Media outlets can prototype automated news briefs, summaries, or social‑media posts in Nepali without resorting to English‑to‑Nepali translation pipelines.
Conversational agents: Chatbots and voice assistants built for Nepal can now rely on a generative backbone that produces natural‑sounding replies, improving user experience.
Low‑resource fine‑tuning: Because the base model already captures Nepali morphology, downstream tasks (summarization, question answering) can be fine‑tuned with far fewer labeled examples than multilingual LLMs.
Open‑source ecosystem: The tokenizer and training scripts are lightweight enough to run on a single high‑end GPU, encouraging community contributions and domain‑specific extensions (e.g., legal or medical Nepali text).

Limitations & Future Work

Scale: The model is still a GPT‑2‑size network; larger architectures could push perplexity lower and improve long‑form coherence.
Data diversity: Training data is dominated by news text; other domains (literature, informal social media) are under‑represented, which may limit style transfer.
Evaluation breadth: The paper reports loss and perplexity but lacks human‑rated benchmarks for factuality, bias, or toxicity in Nepali.
Future directions suggested by the authors include: scaling to GPT‑3‑level parameters, incorporating multilingual code‑switching data (common in Nepal), and releasing a benchmark suite for Nepali generative tasks.

Authors

Adarsha Shrestha
Basanta Pokharel
Binit Shrestha
Smriti Adhikari
Dinesh Gothe

Paper Information

arXiv ID: 2512.14585v1
Categories: cs.CL, cs.AI
Published: December 16, 2025
PDF: Download PDF

[Paper] Towards Nepali-language LLMs: Efficient GPT training with a Nepali BPE tokenizer

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] When Reasoning Meets Its Laws

[Paper] ShareChat: A Dataset of Chatbot Conversations in the Wild

[Paper] Bangla MedER: Multi-BERT Ensemble Approach for the Recognition of Bangla Medical Entity

[Paper] AncientBench: Towards Comprehensive Evaluation on Excavated and Transmitted Chinese Corpora