[Paper] Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay
Source: arXiv - 2602.06942v1
Overview
Tokenization is the first step that turns raw text into something a neural model can understand, and its design becomes especially critical for morphologically rich languages like Turkish. This paper delivers the first large‑scale, systematic study of Turkish subword tokenizers, jointly varying vocabulary size and training‑corpus size, and evaluating a suite of downstream tasks ranging from sentiment analysis to dependency parsing. The authors also introduce a rich, morphology‑aware diagnostic toolkit that explains why certain tokenization choices succeed or fail.
Key Contributions
- Comprehensive “subwords manifest”: Simultaneously varies vocabulary size and tokenizer training data, enabling a controlled exploration of the data‑vocab‑performance triangle.
- Broad tokenizer comparison: Benchmarks WordPiece, a morphology‑level tokenizer (trained on morpheme boundaries), and a pure character baseline under identical parameter budgets.
- Morphology‑aware diagnostics: New intrinsic metrics (boundary‑level micro/macro F1, lemma‑atomicity vs. surface hits, over/under‑segmentation indices, CER/WER, continuation rates, affix‑type coverage) that link tokenization quality to downstream results.
- Extensive downstream evaluation: Tests on semantic tasks (NLI, STS, sentiment, NER), syntactic tasks (POS tagging, dependency parsing), and dedicated morphology probes.
- Open‑source release: Code, tokenizer pipelines, and pretrained models are publicly available, establishing a reproducible baseline for future work on Turkish and other MRLs.
Methodology
-
Data‑Vocabulary Coupling – The authors create multiple training corpora (ranging from 10 M to 100 M Turkish sentences) and, for each corpus, train tokenizers with vocabularies of 8 k, 16 k, 32 k, and 64 k tokens. This ensures that any performance change can be attributed to the interplay of data size and vocab size rather than uncontrolled variables.
-
Tokenizer Families
- WordPiece: The standard subword algorithm used in BERT‑style models.
- Morphology‑Level: Tokens are forced to align with morpheme boundaries obtained from a high‑quality Turkish morphological analyzer.
- Character Baseline: Each character is a token, serving as a lower bound for segmentation granularity.
-
Training Regime – All tokenizers are trained on the same raw Turkish corpus, using identical hyper‑parameters (e.g., learning rate, number of training steps) to keep the parameter budget constant across families.
-
Evaluation Suite
- Intrinsic: The morphology‑aware toolkit measures how well token boundaries match true morpheme boundaries, quantifies over‑/under‑segmentation, and reports edit‑distance based scores.
- Extrinsic: Fine‑tuned transformer models (based on the same architecture) are evaluated on 7 downstream tasks, providing a real‑world performance picture.
-
Analysis Pipeline – Correlation analyses link intrinsic diagnostics to downstream scores, revealing which tokenization properties matter most for each task type.
Results & Findings
- Vocabulary size matters, but only up to a point – For semantic tasks, moving from 8 k to 32 k tokens yields noticeable gains, while 64 k offers diminishing returns.
- Morphology‑level tokenizers excel on syntax‑heavy tasks – POS tagging and dependency parsing see up to 3.2 % absolute F1 improvement over WordPiece when the tokenizer respects morpheme boundaries.
- Character baseline lags on all tasks – Despite perfect coverage of morphemes, the lack of higher‑level units hurts model efficiency and downstream accuracy.
- Data size amplifies benefits – Larger training corpora (≥ 50 M sentences) make the advantages of morphology‑aware tokenization more pronounced, especially for low‑resource downstream tasks like Turkish NER.
- Diagnostic toolkit predicts performance – Boundary‑level micro F1 and affix‑type coverage correlate strongly (ρ ≈ 0.78) with downstream F1 on syntactic tasks, confirming that fine‑grained token‑boundary quality drives model success.
Practical Implications
- Model builders can now choose a tokenizer strategy based on task requirements: use a morphology‑aware tokenizer for parsing, POS tagging, or any task where syntactic fidelity is crucial; stick with WordPiece for general‑purpose semantic tasks where a moderate vocab size suffices.
- Resource‑constrained teams can save compute by opting for a 32 k WordPiece vocab trained on a modest (≈ 20 M sentence) corpus without sacrificing much performance on sentiment or NLI.
- Pipeline integration – The released tokenizer pipelines can be dropped into existing Hugging Face workflows, allowing developers to swap tokenizers with a single line of code and instantly reap the benefits.
- Cross‑lingual transfer – The methodology and diagnostics are language‑agnostic, offering a blueprint for building effective tokenizers for other agglutinative languages (e.g., Finnish, Hungarian, Korean).
Limitations & Future Work
- Morphological analyzer dependency – The morphology‑level tokenizer relies on a high‑quality analyzer; languages lacking such tools may not reap the same gains.
- Scope of downstream tasks – While the suite is broad, it omits generation‑focused tasks (e.g., machine translation, summarization) where tokenization effects could differ.
- Compute budget – Training large vocabularies on the biggest corpora still demands substantial GPU resources, which may be prohibitive for smaller teams.
- Future directions suggested by the authors include extending the evaluation to generative models, exploring unsupervised morpheme discovery to reduce reliance on external analyzers, and applying the diagnostic toolkit to multilingual tokenizers to study cross‑lingual transfer dynamics.
Authors
- Duygu Altinok
Paper Information
- arXiv ID: 2602.06942v1
- Categories: cs.CL, cs.AI
- Published: February 6, 2026
- PDF: Download PDF