[Paper] Under-resourced studies of under-resourced languages: lemmatization and POS-tagging with LLM annotators for historical Armenian, Georgian, Greek and Syriac

Published: 2 months ago (February 17, 2026 at 12:34 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.15753v1

Overview

This paper explores whether today’s large language models (LLMs) can jump‑start core NLP tasks—lemmatization and part‑of‑speech (POS) tagging—for languages that have almost no digital resources. The authors test GPT‑4‑style models and the open‑weight Mistral family on four historically important but under‑documented languages (Ancient Greek, Classical Armenian, Old Georgian, and Syriac) and find that, even without any fine‑tuning, the models often match or beat a dedicated RNN baseline.

Key Contributions

First systematic benchmark for lemmatization and POS‑tagging across four historically under‑resourced languages, with aligned training and out‑of‑domain test sets.
Zero‑shot and few‑shot evaluation of both closed‑source (GPT‑4 variants) and open‑source (Mistral) LLMs on these tasks.
Empirical evidence that LLMs can serve as strong “annotation assistants” for languages lacking annotated corpora, often surpassing a task‑specific RNN baseline (PIE).
Error analysis that pinpoints where morphology complexity and non‑Latin scripts still trip up the models.
Open‑source release of the benchmark data and prompting scripts, enabling reproducibility and further research.

Methodology

Data preparation – The authors compiled parallel corpora for each language: a modest “training” slice (used only for few‑shot prompts) and a separate, out‑of‑domain test set to gauge generalisation.
Prompt design – For few‑shot experiments, they supplied the LLM with 5–10 hand‑picked examples of word‑form → lemma / POS pairs, formatted as plain text. Zero‑shot runs received only a concise task description.
Model selection – Experiments covered:
- GPT‑4‑Turbo and GPT‑4‑Vision (via OpenAI API)
- Mistral‑7B‑Instruct and a fine‑tuned Mistral‑7B‑Chat variant (open weights)
Evaluation metrics – Lemmatization accuracy (exact match) and POS‑tagging F1 (macro‑averaged) were computed against gold annotations. Results were compared to the PIE RNN baseline, which was trained on the same limited data.
Error categorisation – Mis‑predictions were grouped by morphological phenomena (e.g., inflectional suffixes, clitics) and script‑related issues (Unicode normalization, diacritics).

Results & Findings

Language	Task	GPT‑4 (few‑shot)	Mistral‑7B (few‑shot)	PIE baseline
Ancient Greek	Lemma	92.1 %	88.4 %	84.7 %
Ancient Greek	POS	96.3 %	94.8 %	92.1 %
Classical Armenian	Lemma	89.6 %	90.2 %	85.3 %
Classical Armenian	POS	95.0 %	93.7 %	90.8 %
Old Georgian	Lemma	78.4 %	80.1 %	71.5 %
Old Georgian	POS	88.9 %	86.5 %	82.2 %
Syriac	Lemma	84.7 %	81.3 %	77.0 %
Syriac	POS	90.2 %	91.5 %	86.4 %

Take‑aways

Few‑shot prompting consistently outperforms the RNN baseline, even when only a handful of examples are supplied.
GPT‑4 leads on languages with richer Latin‑based tokenisation (Greek, Armenian), while Mistral narrows the gap on scripts that require more careful Unicode handling (Georgian, Syriac).
Zero‑shot performance is markedly lower, confirming that a minimal set of exemplars is crucial for these tasks.
The biggest error clusters involve complex inflectional chains (e.g., stacked suffixes in Georgian) and script‑specific tokenisation (Syriac ligatures), indicating where future model improvements should focus.

Practical Implications

Rapid corpus bootstrapping – Developers can use an LLM as a first‑pass annotator to generate lemmas and POS tags for digitised manuscripts, saving weeks of manual work.
Low‑cost pipeline – Since no fine‑tuning is required, teams can leverage existing API access (or open‑source models) to enrich historical text collections without building language‑specific models from scratch.
Tool integration – The prompting scripts can be wrapped into annotation platforms (e.g., INCEpTION, Prodigy) to provide on‑the‑fly suggestions that human annotators can accept or correct, creating a virtuous feedback loop.
Cross‑lingual transfer – The success across unrelated language families suggests that LLMs can serve as a universal “linguistic back‑stop” for any low‑resource language, including modern endangered languages lacking digital corpora.
Open‑source democratization – By releasing the benchmark and prompts, the authors enable NGOs, digital humanities labs, and small startups to experiment without large data‑collection budgets.

Limitations & Future Work

Script handling – Non‑Latin scripts still cause tokenisation mismatches; better Unicode normalization or script‑aware tokenisers could improve results.
Morphological depth – Extremely agglutinative or polysynthetic patterns (not covered in the four languages) remain challenging for current LLMs.
Zero‑shot gap – The models rely on a few examples; fully zero‑shot performance is insufficient for production use.
Evaluation scope – The benchmark focuses on lemmatization and POS; extending to dependency parsing, named‑entity recognition, or semantic role labeling would test the limits of LLMs further.
Resource constraints – While open‑source Mistral models are cheaper than GPT‑4, inference latency and memory footprints may still be prohibitive for large‑scale digitisation projects; model distillation or quantisation could be explored.

Bottom line: This study shows that modern LLMs are already powerful enough to act as “smart annotators” for languages that have historically been left out of the NLP map. For developers building pipelines around historical texts or endangered language resources, a few well‑chosen examples can unlock high‑quality lemmatization and POS tagging without the overhead of training bespoke models.

Authors

Chahan Vidal‑Gorène
Bastien Kindt
Florian Cafiero

Paper Information

arXiv ID: 2602.15753v1
Categories: cs.CL
Published: February 17, 2026
PDF: Download PDF

[Paper] Under-resourced studies of under-resourced languages: lemmatization and POS-tagging with LLM annotators for historical Armenian, Georgian, Greek and Syriac

Overview

Key Contributions

Methodology

Results & Findings

Take‑aways

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] VIRAASAT: Traversing Novel Paths for Indian Cultural Reasoning

[Paper] RVR: Retrieve-Verify-Retrieve for Comprehensive Question Answering

[Paper] SPQ: An Ensemble Technique for Large Language Model Compression

[Paper] Subgroups of $U(d)$ Induce Natural RNN and Transformer Architectures