[Paper] Under-resourced studies of under-resourced languages: lemmatization and POS-tagging with LLM annotators for historical Armenian, Georgian, Greek and Syriac

Published: (February 17, 2026 at 12:34 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.15753v1

Overview

This paper explores whether today’s large language models (LLMs) can jump‑start core NLP tasks—lemmatization and part‑of‑speech (POS) tagging—for languages that have almost no digital resources. The authors test GPT‑4‑style models and the open‑weight Mistral family on four historically important but under‑documented languages (Ancient Greek, Classical Armenian, Old Georgian, and Syriac) and find that, even without any fine‑tuning, the models often match or beat a dedicated RNN baseline.

Key Contributions

  • First systematic benchmark for lemmatization and POS‑tagging across four historically under‑resourced languages, with aligned training and out‑of‑domain test sets.
  • Zero‑shot and few‑shot evaluation of both closed‑source (GPT‑4 variants) and open‑source (Mistral) LLMs on these tasks.
  • Empirical evidence that LLMs can serve as strong “annotation assistants” for languages lacking annotated corpora, often surpassing a task‑specific RNN baseline (PIE).
  • Error analysis that pinpoints where morphology complexity and non‑Latin scripts still trip up the models.
  • Open‑source release of the benchmark data and prompting scripts, enabling reproducibility and further research.

Methodology

  1. Data preparation – The authors compiled parallel corpora for each language: a modest “training” slice (used only for few‑shot prompts) and a separate, out‑of‑domain test set to gauge generalisation.
  2. Prompt design – For few‑shot experiments, they supplied the LLM with 5–10 hand‑picked examples of word‑form → lemma / POS pairs, formatted as plain text. Zero‑shot runs received only a concise task description.
  3. Model selection – Experiments covered:
    • GPT‑4‑Turbo and GPT‑4‑Vision (via OpenAI API)
    • Mistral‑7B‑Instruct and a fine‑tuned Mistral‑7B‑Chat variant (open weights)
  4. Evaluation metrics – Lemmatization accuracy (exact match) and POS‑tagging F1 (macro‑averaged) were computed against gold annotations. Results were compared to the PIE RNN baseline, which was trained on the same limited data.
  5. Error categorisation – Mis‑predictions were grouped by morphological phenomena (e.g., inflectional suffixes, clitics) and script‑related issues (Unicode normalization, diacritics).

Results & Findings

LanguageTaskGPT‑4 (few‑shot)Mistral‑7B (few‑shot)PIE baseline
Ancient GreekLemma92.1 %88.4 %84.7 %
Ancient GreekPOS96.3 %94.8 %92.1 %
Classical ArmenianLemma89.6 %90.2 %85.3 %
Classical ArmenianPOS95.0 %93.7 %90.8 %
Old GeorgianLemma78.4 %80.1 %71.5 %
Old GeorgianPOS88.9 %86.5 %82.2 %
SyriacLemma84.7 %81.3 %77.0 %
SyriacPOS90.2 %91.5 %86.4 %

Take‑aways

  • Few‑shot prompting consistently outperforms the RNN baseline, even when only a handful of examples are supplied.
  • GPT‑4 leads on languages with richer Latin‑based tokenisation (Greek, Armenian), while Mistral narrows the gap on scripts that require more careful Unicode handling (Georgian, Syriac).
  • Zero‑shot performance is markedly lower, confirming that a minimal set of exemplars is crucial for these tasks.
  • The biggest error clusters involve complex inflectional chains (e.g., stacked suffixes in Georgian) and script‑specific tokenisation (Syriac ligatures), indicating where future model improvements should focus.

Practical Implications

  • Rapid corpus bootstrapping – Developers can use an LLM as a first‑pass annotator to generate lemmas and POS tags for digitised manuscripts, saving weeks of manual work.
  • Low‑cost pipeline – Since no fine‑tuning is required, teams can leverage existing API access (or open‑source models) to enrich historical text collections without building language‑specific models from scratch.
  • Tool integration – The prompting scripts can be wrapped into annotation platforms (e.g., INCEpTION, Prodigy) to provide on‑the‑fly suggestions that human annotators can accept or correct, creating a virtuous feedback loop.
  • Cross‑lingual transfer – The success across unrelated language families suggests that LLMs can serve as a universal “linguistic back‑stop” for any low‑resource language, including modern endangered languages lacking digital corpora.
  • Open‑source democratization – By releasing the benchmark and prompts, the authors enable NGOs, digital humanities labs, and small startups to experiment without large data‑collection budgets.

Limitations & Future Work

  • Script handling – Non‑Latin scripts still cause tokenisation mismatches; better Unicode normalization or script‑aware tokenisers could improve results.
  • Morphological depth – Extremely agglutinative or polysynthetic patterns (not covered in the four languages) remain challenging for current LLMs.
  • Zero‑shot gap – The models rely on a few examples; fully zero‑shot performance is insufficient for production use.
  • Evaluation scope – The benchmark focuses on lemmatization and POS; extending to dependency parsing, named‑entity recognition, or semantic role labeling would test the limits of LLMs further.
  • Resource constraints – While open‑source Mistral models are cheaper than GPT‑4, inference latency and memory footprints may still be prohibitive for large‑scale digitisation projects; model distillation or quantisation could be explored.

Bottom line: This study shows that modern LLMs are already powerful enough to act as “smart annotators” for languages that have historically been left out of the NLP map. For developers building pipelines around historical texts or endangered language resources, a few well‑chosen examples can unlock high‑quality lemmatization and POS tagging without the overhead of training bespoke models.

Authors

  • Chahan Vidal‑Gorène
  • Bastien Kindt
  • Florian Cafiero

Paper Information

  • arXiv ID: 2602.15753v1
  • Categories: cs.CL
  • Published: February 17, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »