[Paper] Broken Words, Broken Performance: Effect of Tokenization on Performance of LLMs

Published: 1 month ago (December 26, 2025 at 04:16 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.21933v1

Overview

The paper “Broken Words, Broken Performance: Effect of Tokenization on Performance of LLMs” investigates a surprisingly simple yet under‑explored factor that can hurt large language models: the way their tokenizers split ordinary words into multiple sub‑tokens. By quantifying how “broken” a tokenization is, the authors show that higher fragmentation correlates with lower accuracy across a variety of downstream NLP tasks.

Key Contributions

Tokenization‑Penalty Metrics – Introduces a family of lightweight penalty functions that score any piece of text according to how badly it is broken up by a given model’s tokenizer.
Empirical Correlation Study – Demonstrates statistically significant links between tokenization penalty and performance drops on tasks such as sentiment analysis, NER, QA, and summarisation.
Cross‑Model Analysis – Evaluates the hypothesis on several open‑source LLMs (e.g., Mistral, Llama‑2, Falcon) showing the effect is consistent regardless of architecture or size.
Practical Diagnostic Tool – Provides open‑source code for computing penalties, enabling developers to spot “high‑risk” inputs before feeding them to a model.
Guidelines for Mitigation – Offers concrete recommendations (e.g., vocabulary augmentation, prompt preprocessing) to reduce tokenization‑induced errors.

Methodology

Define Penalty Functions – The authors design three simple metrics:
- Fragmentation Ratio: number of sub‑tokens per natural word.
- Rare‑Subtoken Weight: higher weight for sub‑tokens that appear infrequently in the model’s training corpus.
- Boundary Disruption Score: penalizes splits that cut across morpheme boundaries (detected via a lightweight morphological analyzer).
Dataset Preparation – Standard benchmark datasets for each task (e.g., SST‑2 for sentiment, CoNLL‑2003 for NER) are tokenized with each model’s native tokenizer.
Correlation Analysis – For every example, the penalty score is computed and then correlated with the model’s prediction correctness (binary success/failure). Statistical significance is assessed using Pearson’s r and permutation tests.
Ablation Experiments – The authors artificially “repair” high‑penalty inputs by merging split sub‑tokens (where possible) and observe performance recovery, confirming causality rather than mere correlation.

The entire pipeline is implemented in Python with only the model’s tokenizer and a small morphological lookup table, making it reproducible on commodity hardware.

Results & Findings

Model	Avg. Fragmentation Ratio	Accuracy Drop (high‑penalty vs. low‑penalty)
Mistral‑7B	1.42	–4.7 % (sentiment)
Llama‑2‑13B	1.31	–3.2 % (NER)
Falcon‑40B	1.27	–2.9 % (QA)

Statistical Significance – All correlations are significant at p < 0.001 after Bonferroni correction.
Repair Gains – Merging split tokens (e.g., “martial” → “martial”) recovers 2–3 % absolute accuracy on the hardest examples.
Task Sensitivity – Tokenization penalties hurt tasks that rely heavily on lexical cues (NER, sentiment) more than generative tasks (summarisation).

Overall, the study confirms that the more a natural word is fragmented, the higher the chance the model will misinterpret it.

Practical Implications

Prompt Engineering – Before sending a prompt to an LLM, run the penalty calculator. If the score exceeds a threshold, consider re‑phrasing or using synonyms that stay intact in the tokenizer’s vocabulary.
Custom Tokenizer Extensions – For domain‑specific vocabularies (e.g., medical or legal jargon), adding high‑frequency words to the tokenizer can dramatically reduce fragmentation and improve downstream accuracy.
Model Selection – When choosing an LLM for a word‑sensitive application, compare average fragmentation ratios on a representative corpus; a lower ratio often translates to better out‑of‑the‑box performance.
Debugging Tool – The open‑source penalty library can be integrated into CI pipelines to flag data samples that are likely to cause failures, enabling early data cleaning.
Fine‑Tuning Strategies – During fine‑tuning, augment the loss with a tokenization‑penalty regularizer, encouraging the model to rely less on split sub‑tokens for critical predictions.

Limitations & Future Work

Morphological Approximation – The boundary disruption score uses a simple rule‑based analyzer, which may misidentify splits in languages with complex morphology.
Scope of Models – Experiments focus on a handful of open‑source LLMs; proprietary models (e.g., GPT‑4) may exhibit different sensitivities.
Mitigation Techniques – While the paper proposes vocabulary augmentation, it does not explore the trade‑offs of larger vocabularies (e.g., increased memory, slower inference).
Dynamic Tokenizers – Future work could investigate adaptive tokenizers that learn to merge high‑penalty sub‑tokens on the fly, or tokenization‑aware training objectives that directly penalize fragmentation.

By shining a light on the hidden cost of “broken words,” this research opens a practical avenue for developers to squeeze more reliability out of existing LLMs without heavyweight model changes.

Authors

Sachin Pawar
Manoj Apte
Kshitij Jadhav
Girish Keshav Palshikar
Nitin Ramrakhiyani

Paper Information

arXiv ID: 2512.21933v1
Categories: cs.CL
Published: December 26, 2025
PDF: Download PDF

[Paper] Broken Words, Broken Performance: Effect of Tokenization on Performance of LLMs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] A2P-Vis: an Analyzer-to-Presenter Agentic Pipeline for Visual Insights Generation and Reporting

[Paper] Introducing TrGLUE and SentiTurca: A Comprehensive Benchmark for Turkish General Language Understanding and Sentiment Analysis

[Paper] Unifying Learning Dynamics and Generalization in Transformers Scaling Law

[Paper] Context as a Tool: Context Management for Long-Horizon SWE-Agents