[Paper] BERnaT: Basque Encoders for Representing Natural Textual Diversity
Source: arXiv - 2512.03903v1
Overview
The paper introduces BERnaT, a family of Basque language encoders that are deliberately trained on a mix of standard, historical, and social‑media text. By doing so, the authors demonstrate that language models can become more robust and inclusive, handling dialectal and informal variations without sacrificing performance on traditional benchmarks.
Key Contributions
- Diverse Corpus Construction – Combined three sources (standard literature, historical documents, and social‑media posts) to create a richer Basque training set.
- Three Model Variants – Trained encoder‑only models on (i) only standard data, (ii) only diverse data, and (iii) a combined mix, enabling direct comparison.
- Evaluation Split – Proposed a novel benchmark split that separates NLU tasks into standard and diverse subsets, making it easy to measure how well a model generalizes across linguistic varieties.
- Empirical Evidence – Showed that models exposed to both standard and diverse data consistently outperform those trained on standard data alone, across all task categories.
- Open‑source Release – Made the corpora, pretrained checkpoints, and evaluation scripts publicly available for the community.
Methodology
-
Data Gathering
- Standard: Contemporary Basque news articles and Wikipedia.
- Historical: Digitized books and newspapers dating back to the 19th century.
- Social Media: Posts from platforms like Twitter and Reddit, capturing slang, dialects, and code‑switching. All texts were cleaned, deduplicated, and tokenized using a shared subword vocabulary.
-
Model Architecture
- Used a standard Transformer encoder (12 layers, 768 hidden size) similar to BERT‑base.
- Trained three configurations: BERnaT‑Std, BERnaT‑Div, and BERnaT‑All (standard + diverse).
-
Training Regimen
- Masked Language Modeling (MLM) objective with a 15 % token masking rate.
- Trained for 1 M steps on 8 A100 GPUs, employing mixed‑precision to speed up convergence.
-
Evaluation Framework
- Selected a suite of Basque NLU tasks (sentiment analysis, named‑entity recognition, question answering, etc.).
- For each task, created a standard test set (derived from the same source as the standard corpus) and a diverse test set (drawn from historical/social‑media data).
- Reported macro‑F1 or exact‑match scores depending on the task.
Results & Findings
| Model | Standard Test Avg. | Diverse Test Avg. | Overall Δ vs. Std‑Only |
|---|---|---|---|
| BERnaT‑Std | 84.2 % | 68.5 % | – |
| BERnaT‑Div | 81.7 % | 73.9 % | +5.4 % (diverse) |
| BERnaT‑All | 85.1 % | 77.2 % | +8.7 % (diverse) |
- All‑data model improves diverse test performance by ~9 % while slightly nudging standard accuracy upward.
- The gains are consistent across tasks: sentiment analysis on tweets jumps from 66 % to 78 % F1, and historical NER improves from 71 % to 80 % F1.
- No trade‑off is observed; the model does not overfit to noisy social media text, thanks to the balanced training mix.
Practical Implications
- More Inclusive Applications – Chatbots, search, and moderation tools built on BERnaT can understand regional dialects and informal language, reducing user friction for speakers outside the “standard” norm.
- Low‑Resource Transfer – The approach shows that even for a language with limited data, adding diverse, noisy sources yields tangible benefits, suggesting a recipe for other under‑represented languages.
- Robustness to Domain Shift – Deployments that encounter out‑of‑distribution text (e.g., user‑generated content) will likely see fewer failures, lowering maintenance costs.
- Open‑source Toolkit – Developers can fine‑tune the released checkpoints on downstream tasks without needing to collect and clean massive corpora themselves.
Limitations & Future Work
- Data Quality Variance – Social‑media text contains spelling errors and code‑switching that may still bias the model toward dominant dialects.
- Scale – Experiments were limited to a BERT‑base‑size model; it remains unclear how the findings scale to larger architectures.
- Evaluation Breadth – The benchmark focuses on a handful of NLU tasks; generative or dialogue‑oriented evaluations are left for future research.
- Cross‑Language Generalization – While promising for Basque, the authors note that replicating the pipeline for typologically different languages (e.g., agglutinative vs. fusional) warrants further study.
Authors
- Ekhi Azurmendi
- Joseba Fernandez de Landa
- Jaione Bengoetxea
- Maite Heredia
- Julen Etxaniz
- Mikel Zubillaga
- Ander Soraluze
- Aitor Soroa
Paper Information
- arXiv ID: 2512.03903v1
- Categories: cs.CL, cs.AI
- Published: December 3, 2025
- PDF: Download PDF