[Paper] Classifying several dialectal Nawatl varieties
Source: arXiv - 2601.02303v1
Overview
The paper tackles a surprisingly under‑explored problem in natural‑language processing: automatically distinguishing among the many dialectal varieties of Nawatl, an indigenous Mexican language spoken by over two million people. By building and evaluating machine‑learning classifiers (including neural networks) on a newly assembled corpus of Nawatl texts, the authors demonstrate that computational methods can reliably identify dialectal differences—opening the door to better language‑technology support for a historically marginalized linguistic community.
Key Contributions
- First large‑scale dialect classification dataset for Nawatl – the authors collected, cleaned, and annotated texts from roughly 30 recognized varieties, handling orthographic variation and scarce resources.
- Benchmark ML & neural‑network models – they compare traditional classifiers (SVM, Random Forest) with modern deep‑learning approaches (CNNs, Bi‑LSTMs, transformer‑based encoders) on the dialect identification task.
- Feature engineering for low‑resource languages – the study evaluates character‑n‑grams, phoneme‑level representations, and subword embeddings (Byte‑Pair Encoding) tailored to Nawatl’s morphophonology.
- Error analysis linking linguistic traits to model confusion – the authors map misclassifications to known linguistic similarities (e.g., shared vowel harmony or lexical borrowing), providing insights for future linguistic work.
- Open‑source release – code, pre‑processed data splits, and trained models are made publicly available, encouraging reproducibility and further research on Nawatl and other under‑resourced languages.
Methodology
-
Data collection & preprocessing
- Texts were gathered from online archives, community newsletters, and transcribed oral recordings.
- Each document was tagged with its reported dialect (e.g., Huasteca, Sierra Norte, Central Puebla).
- Orthographic normalization was performed using a rule‑based mapper to reduce spelling noise while preserving dialect‑specific phonetic cues.
-
Feature extraction
- Character‑level n‑grams (3‑5) to capture orthographic patterns.
- Subword units via Byte‑Pair Encoding (BPE) to handle agglutinative morphology.
- Phoneme‑level transcriptions generated with a lightweight grapheme‑to‑phoneme model, allowing the system to learn sound‑based distinctions.
-
Model suite
- Baseline: Linear SVM and Random Forest on TF‑IDF vectors.
- CNN: 1‑D convolution over character embeddings, followed by max‑pooling.
- Bi‑LSTM: Sequential modeling of subword embeddings to capture long‑range dependencies.
- Transformer encoder (XLM‑R) fine‑tuned on the Nawatl corpus, leveraging multilingual pre‑training.
-
Training & evaluation
- Stratified 5‑fold cross‑validation to respect the imbalanced distribution of dialects.
- Primary metric: macro‑averaged F1‑score (to treat all dialects equally).
- Additional analysis: confusion matrices, per‑dialect precision/recall, and ablation studies on feature sets.
Results & Findings
| Model | Macro‑F1 | Accuracy |
|---|---|---|
| Linear SVM (TF‑IDF) | 0.62 | 68 % |
| Random Forest (char‑ngrams) | 0.65 | 71 % |
| CNN (char‑embeddings) | 0.73 | 78 % |
| Bi‑LSTM (BPE) | 0.77 | 81 % |
| XLM‑R (fine‑tuned) | 0.84 | 88 % |
- The transformer‑based model outperformed all others, confirming that multilingual pre‑training can be transferred even to a language with minimal digital presence.
- Character‑level features alone already yielded respectable performance, highlighting the strong orthographic cues that differentiate dialects.
- Error analysis revealed that the most confused pairs were geographically adjacent varieties (e.g., Huasteca vs. Sierra Norte), which aligns with known linguistic continua.
Practical Implications
- Dialect‑aware language tools – Spell‑checkers, predictive keyboards, and speech‑recognition systems can now adapt to the specific variety a user speaks, improving usability for Nawatl speakers.
- Digital preservation – Automated tagging of archival texts by dialect facilitates the organization of cultural heritage collections and supports community‑led revitalization projects.
- Cross‑dialect NLP pipelines – Machine‑translation, sentiment analysis, or information retrieval systems can incorporate dialect identification as a preprocessing step, reducing error propagation.
- Template for other low‑resource languages – The workflow (data gathering, orthographic normalization, subword modeling) provides a reproducible blueprint for developers working on other indigenous or endangered languages with multiple dialects.
Limitations & Future Work
- Data sparsity – Some dialects are represented by only a handful of documents, limiting the model’s ability to generalize; future work should explore data‑augmentation or few‑shot learning techniques.
- Orthographic standardization – While the authors applied a normalization pipeline, the lack of a universally accepted writing system for Nawatl means that some dialect‑specific orthographic signals may have been unintentionally erased.
- Speech modality – The study focuses exclusively on written text; extending the approach to audio (dialect‑aware ASR) would broaden real‑world applicability.
- Explainability – Deeper linguistic probing (e.g., attention analysis) could reveal which phonological or morphological features drive classification, offering feedback to linguists and community members.
By demonstrating that modern NLP methods can reliably differentiate among Nawatl dialects, this research paves the way for more inclusive, culturally aware language technologies—an essential step toward digital equity for indigenous language communities.
Authors
- Juan-José Guzmán-Landa
- Juan-Manuel Torres-Moreno
- Miguel Figueroa-Saavedra
- Carlos-Emiliano González-Gallardo
- Graham Ranger
- Martha Lorena-Avendaño-Garrido
Paper Information
- arXiv ID: 2601.02303v1
- Categories: cs.CL
- Published: January 5, 2026
- PDF: Download PDF