[Paper] What Language is This? Ask Your Tokenizer
Source: arXiv - 2602.17655v1
Overview
Language identification (LID) is the first step in many multilingual NLP pipelines, but current tools stumble when faced with low‑resource languages or closely related dialects. The paper “What Language is This? Ask Your Tokenizer” proposes UniLID, a lightweight LID system that re‑uses the tokenizer already employed by large language models. By treating token segmentation as language‑specific while sharing a common vocabulary, UniLID delivers strong accuracy with minimal data and compute, making it a practical drop‑in for developers building multilingual applications.
Key Contributions
- Token‑centric LID: Introduces a novel LID approach that learns language‑conditional unigram probabilities over a shared tokenizer vocabulary.
- Data‑efficient training: Achieves >70 % accuracy with as few as five labeled examples per language, dramatically reducing annotation costs.
- Incremental language addition: New languages can be added without retraining the entire model, thanks to the modular unigram‑distribution design.
- Competitive benchmark performance: Matches or exceeds established baselines (fastText, GlotLID, CLD3) on standard LID datasets.
- Fine‑grained dialect detection: Shows large gains in distinguishing closely related dialects, a known weakness of existing systems.
Methodology
UniLID builds on the UnigramLM tokenization algorithm, which models text as a sequence of independently drawn tokens from a vocabulary. The authors extend this idea in two ways:
- Language‑conditional unigram distributions – For each language, a separate probability distribution over the shared token set is learned.
- Language‑specific segmentation – During inference, the tokenizer is allowed to segment the same raw string differently depending on the language hypothesis, reflecting real‑world orthographic variations (e.g., different word‑boundary conventions).
Training proceeds by maximizing the likelihood of a few labeled sentences per language, which is computationally cheap because only unigram counts need to be updated. At inference time, the model computes the likelihood of the observed tokenization under each language’s distribution and picks the highest‑scoring language. Because the vocabulary is shared, the system can be plugged directly into any existing LLM tokenization pipeline without extra preprocessing.
Results & Findings
| Setting | Baseline (fastText) | UniLID | Relative Gain |
|---|---|---|---|
| Standard LID benchmark (high‑resource) | 98.3 % | 97.9 % | –0.4 % |
| Low‑resource (5 labeled samples/language) | 58 % | 71 % | +13 % |
| Dialect identification (e.g., Arabic dialects) | 62 % | 78 % | +16 % |
- Sample efficiency: With only five labeled sentences per language, UniLID already surpasses 70 % accuracy, whereas fastText hovers around 58 %.
- Scalability: Adding a new language required updating only its unigram distribution; overall model size stayed constant.
- Speed: Inference adds negligible overhead to the tokenization step (≈ 1–2 ms per sentence on a CPU).
These results indicate that UniLID is not just academically interesting—it delivers tangible performance improvements where data is scarce or languages are closely related.
Practical Implications
- Plug‑and‑play multilingual pipelines: Developers can swap their existing LID component with UniLID and immediately benefit from better low‑resource handling without redesigning tokenizers.
- Cost‑effective data collection: Teams can bootstrap language support with a handful of annotated examples, accelerating product roll‑outs to new markets.
- Improved content moderation & routing: Accurate dialect detection helps route user‑generated content to the right language‑specific moderation models or translation services.
- Incremental language expansion: SaaS platforms can roll out support for emerging languages or regional variants on the fly, keeping the core model unchanged.
Limitations & Future Work
- Reliance on a shared tokenizer: UniLID’s performance hinges on the quality of the underlying tokenizer; poorly designed vocabularies may limit discrimination power.
- Unigram assumption: Modeling tokens independently ignores contextual cues that could further boost accuracy, especially for highly ambiguous scripts.
- Evaluation scope: The paper focuses on a curated set of languages and dialects; broader real‑world testing (e.g., noisy social‑media text) remains to be explored.
Future research directions include extending the framework to sub‑word or character‑level n‑gram models, integrating lightweight contextual signals, and benchmarking UniLID in production‑scale multilingual systems.
Authors
- Clara Meister
- Ahmetcan Yavuz
- Pietro Lesci
- Tiago Pimentel
Paper Information
- arXiv ID: 2602.17655v1
- Categories: cs.CL
- Published: February 19, 2026
- PDF: Download PDF