[Paper] OpenLID-v3: Improving the Precision of Closely Related Language Identification -- An Experience Report

Published: (February 13, 2026 at 12:47 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.13139v1

Overview

The paper presents OpenLID‑v3, an upgraded language‑identification (LID) model that tackles one of the most stubborn problems in multilingual data pipelines: reliably distinguishing closely related languages and filtering out non‑linguistic noise. By enriching the training data, merging ambiguous language clusters, and adding a dedicated “noise” label, the authors achieve higher precision—especially for low‑resource languages—while keeping the system easy to plug into existing workflows.

Key Contributions

  • Extended training corpus: Integrated additional web‑crawled texts, boosting coverage for under‑represented languages.
  • Cluster‑aware labeling: Merged problematic language‑variant groups (e.g., Bosnian/Croatian/Serbian) into single, more robust classes, reducing confusion.
  • Explicit noise detection: Introduced a special label that flags non‑natural‑language content (code snippets, boilerplate, etc.).
  • New benchmark datasets: Curated evaluation sets for three language families where prior benchmarks were insufficient.
  • Empirical comparison: Showed that OpenLID‑v3 outperforms the widely‑used GlotLID system on precision while maintaining comparable recall.
  • Open‑source release: Model and data are publicly available on Hugging Face, ready for immediate integration.

Methodology

  1. Data Augmentation – Harvested extra monolingual corpora from Common Crawl and other public sources, focusing on low‑resource languages and texts previously mis‑classified.
  2. Variant Clustering – Grouped highly similar variants (e.g., the three South‑Slavic languages) into a single label during training, then applied a lightweight post‑processing step to re‑assign finer‑grained tags when context allowed.
  3. Noise Labeling – Added a “noise” class; training examples were generated by mixing HTML snippets, code fragments, and random Unicode strings, teaching the model to reject them outright.
  4. Model Architecture – Built on the original OpenLID transformer backbone (a multilingual BERT‑style encoder) with a modestly larger classification head to accommodate the new labels.
  5. Evaluation – Conducted on three newly created test suites covering (a) Bosnian‑Croatian‑Serbian, (b) Northern‑Italian/Southern‑French Romance varieties, and (c) Scandinavian languages. Metrics include precision, recall, and coverage (the proportion of inputs the model assigns a language rather than “noise”).

All steps are described in a way that developers can reproduce them with standard Python tooling (🤗 Transformers, Datasets, and Hugging Face Hub).

Results & Findings

SystemPrecision (overall)Recall (overall)Coverage (low‑resource)
GlotLID (baseline)84.2 %78.5 %71 %
OpenLID‑v3 (single model)90.7 %77.9 %78 %
OpenLID‑v3 (ensemble)92.3 %73.4 %65 %
  • Precision gains are most pronounced on the three targeted language families, with error rates dropping from ~15 % to <5 % for Bosnian/Croatian/Serbian.
  • Noise detection reduces false positives on web‑scraped data by ~40 %, meaning fewer “garbage” sentences leak into downstream corpora.
  • Ensembling (combining three independently trained checkpoints) pushes precision even higher, but at the cost of lower coverage for scarce languages—an important trade‑off for pipelines that must balance quality vs. quantity.

Practical Implications

  • Cleaner multilingual corpora – Data engineers can plug OpenLID‑v3 into web‑crawling pipelines to automatically filter out mislabeled or noisy rows before they reach downstream models (e.g., translation, sentiment analysis).
  • Better low‑resource language support – Researchers building NLP tools for languages like Bosnian or Sardinian will obtain higher‑quality training data, accelerating model development and reducing the need for manual cleaning.
  • Simplified deployment – The model is hosted on Hugging Face with a ready‑to‑use inference API; developers can call it via a single HTTP request or integrate it into existing PyTorch/TF pipelines.
  • Noise‑aware preprocessing – The explicit “noise” label enables conditional logic: route noisy inputs to a separate cleaning module, log them for quality monitoring, or discard them outright.
  • Scalable ensemble option – For high‑stakes applications (e.g., legal document processing) where precision outweighs coverage, teams can adopt the ensemble variant; for broader web‑scale ingestion, the single‑model version offers a sweet spot.

Limitations & Future Work

  • Coverage trade‑off – The ensemble’s higher precision comes with a noticeable drop in coverage for very low‑resource languages; balancing this remains an open engineering challenge.
  • Variant granularity – Merging language variants simplifies classification but may be insufficient for use‑cases that need fine‑grained dialect identification (e.g., regional speech analytics).
  • Domain bias – Training data are still largely web‑derived; performance on specialized domains (medical, legal) has not been evaluated.
  • Future directions suggested by the authors include: expanding the noise class with more adversarial examples, exploring multilingual adapters to reduce model size, and extending evaluation to additional language families (e.g., South‑Asian scripts).

Authors

  • Mariia Fedorova
  • Nikolay Arefyev
  • Maja Buljan
  • Jindřich Helcl
  • Stephan Oepen
  • Egil Rønningstad
  • Yves Scherrer

Paper Information

  • arXiv ID: 2602.13139v1
  • Categories: cs.CL
  • Published: February 13, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »