[Paper] Building a Strong Instruction Language Model for a Less-Resourced Language

Published: 1 day ago (March 2, 2026 at 05:21 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2603.01691v1

Overview

The authors introduce GaMS3‑12B, a 12‑billion‑parameter generative language model tailored for Slovene—a language that historically receives little attention from open‑source LLMs. By combining multilingual continual pre‑training with supervised fine‑tuning, they demonstrate that a relatively modest‑sized model can rival much larger commercial systems on Slovene tasks, opening the door for more inclusive AI tools.

Key Contributions

GaMS3‑12B: The first open‑source 12 B‑parameter model that consistently outperforms the original Gemma‑3 12 B on Slovene benchmarks.
Three‑stage continual pre‑training pipeline: Extends a high‑quality English base model (Gemma 3) to a multilingual corpus (Slovene + neighboring South‑Slavic languages) while preserving English capabilities.
Two‑stage supervised fine‑tuning (SFT): Leverages >200 k bilingual (English‑Slovene) instruction examples to teach the model how to follow prompts and generate coherent responses.
Comprehensive evaluation suite: Uses Slovenian‑LLM‑Eval, English‑to‑Slovene translation, and the Slovene LLM Arena to benchmark against both open‑source and commercial baselines (including GPT‑4o).
Open‑source release: The model weights, training scripts, and data processing pipelines are publicly released, encouraging community‑driven improvements.

Methodology

Base Model Selection – The authors start from Gemma 3‑12B, a well‑behaved English‑centric LLM with strong zero‑shot abilities.
Continual Pre‑training (3 stages)
- Stage 1: Ingest ~140 B tokens from a multilingual mix (Slovene, English, Bosnian, Serbian, Croatian). This step teaches the model the target language’s vocabulary, syntax, and orthography.
- Stage 2 & 3: Gradually increase the proportion of Slovene data while applying a lower learning rate to avoid catastrophic forgetting of English knowledge.
Supervised Fine‑tuning (2 stages)
- Stage 1 (Instruction Tuning): Train on ~200 k high‑quality English‑Slovene instruction–response pairs (e.g., “Translate X”, “Summarize Y”).
- Stage 2 (Alignment): Fine‑tune with a small reward‑model‑style dataset to improve helpfulness, factuality, and adherence to user intent.
Evaluation – The model is tested on three fronts: (a) Slovenian‑LLM‑Eval (a suite of classification, QA, and reasoning tasks), (b) English‑to‑Slovene translation (BLEU/ChrF scores), and (c) Slovene LLM Arena (pairwise human preference tests against other models).

The pipeline is deliberately modular, allowing developers to swap in different base models or language mixes without redesigning the whole training loop.

Results & Findings

Benchmark	GaMS3‑12B vs. Gemma 3‑12B	GaMS3‑12B vs. GPT‑4o
Slovenian‑LLM‑Eval (average score)	+12 pts improvement	Comparable (within 5 pts)
EN→SL translation (BLEU)	31.4 vs. 27.8 (Gemma)	≈30 (slightly below GPT‑4o)
Slovene LLM Arena win‑rate	>60 % over Gemma 3	≈60 % vs. GPT‑4o (GPT‑4o still leads overall)

Key takeaways:

The multilingual continual pre‑training dramatically lifts Slovene performance without sacrificing English competence.
Even with only 12 B parameters, GaMS3‑12B reaches parity with commercial GPT‑4o in head‑to‑head human preference tests for Slovene‑specific prompts.
The model shows strong zero‑shot translation and instruction‑following abilities, making it a viable drop‑in for Slovene‑centric applications.

Practical Implications

Localized AI Services: Companies can now embed a capable Slovene LLM into chatbots, help desks, or content‑generation pipelines without paying for expensive API calls to proprietary models.
Cross‑Lingual Tools: The bilingual instruction set makes GaMS3‑12B an effective bridge for translation, summarization, and data‑annotation tasks that involve Slovene and English.
Resource‑Efficient Development: At 12 B parameters, the model fits on a single high‑end GPU (or a modest multi‑GPU node), enabling smaller startups or research labs to fine‑tune further for domain‑specific needs (e.g., legal, medical Slovene).
Template for Other Low‑Resource Languages: The three‑stage pre‑training + two‑stage SFT recipe can be replicated for languages with similar data scarcity, accelerating the democratization of LLM capabilities.
Open‑Source Ecosystem Growth: By releasing the model and scripts, the authors invite community contributions—e.g., adding more Slovene instruction data, integrating LoRA adapters, or building evaluation suites for other languages.

Limitations & Future Work

Data Quality & Bias: The multilingual corpus includes web‑scraped text, which may contain noise, outdated terminology, or cultural biases that could surface in generated outputs.
Domain Coverage: While general instruction performance is strong, specialized domains (e.g., technical documentation, scientific literature) were not explicitly fine‑tuned, potentially limiting accuracy.
Evaluation Scope: Human preference tests were limited to a subset of prompts; broader, longer‑form interactions could reveal different strengths/weaknesses.
Scalability: The approach has been validated at 12 B parameters; it remains an open question how well the same pipeline scales to smaller (e.g., 3 B) or larger (e.g., 30 B) models for Slovene.

Future work suggested by the authors includes expanding the instruction dataset with community‑sourced Slovene examples, applying parameter‑efficient fine‑tuning (e.g., LoRA, adapters) for rapid domain adaptation, and extending the methodology to other under‑represented languages in the Balkans and beyond.

Authors

Domen Vreš
Tjaša Arčon
Timotej Petrič
Dario Vajda
Marko Robnik-Šikonja
Iztok Lebar Bajec

Paper Information

arXiv ID: 2603.01691v1
Categories: cs.CL, cs.LG
Published: March 2, 2026
PDF: Download PDF

[Paper] Building a Strong Instruction Language Model for a Less-Resourced Language

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Tool Verification for Test-Time Reinforcement Learning

[Paper] Scaling Retrieval Augmented Generation with RAG Fusion: Lessons from an Industry Deployment

[Paper] Zero- and Few-Shot Named-Entity Recognition: Case Study and Dataset in the Crime Domain (CrimeNER)

[Paper] LLMs as Strategic Actors: Behavioral Alignment, Risk Calibration, and Argumentation Framing in Geopolitical Simulations