[Paper] Teaching Old Tokenizers New Words: Efficient Tokenizer Adaptation for Pre-trained Models

Published: 2 months ago (December 3, 2025 at 12:20 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.03989v1

Overview

The paper tackles a surprisingly practical problem: how to adapt the tokenizer of a large pre‑trained language model when you move it to a new domain or language. Rather than retraining a whole model, the authors show that modest, targeted changes to the tokenizer—extending it efficiently and pruning unused pieces—can boost performance and reduce waste, all while keeping the original model intact.

Key Contributions

Continued BPE training: a method that “continues” the byte‑pair‑encoding merge process on domain‑specific data, avoiding the proliferation of dead tokens that plague naïve vocabulary extension.
Leaf‑based vocabulary pruning: an algorithm that safely removes redundant sub‑tokens (leaf nodes) from the BPE tree, shrinking the vocabulary without hurting downstream accuracy.
Comprehensive evaluation across several languages (English, Russian, Finnish, etc.) and model families (BERT, RoBERTa, XLM‑R), demonstrating consistent gains in tokenization efficiency and downstream task scores.
Open‑source toolkit: a ready‑to‑use Python package that lets practitioners extend or prune tokenizers with a few lines of code.

Methodology

Baseline tokenizer extension – The usual recipe: train a fresh BPE tokenizer on the new corpus, then append any new tokens to the existing vocabulary. This often creates many tokens that never appear in practice because the original tokenizer already covers most sub‑words.
Continued BPE training – Instead of starting from scratch, the authors resume the original BPE merge operations on the new data. Concretely:
- Load the original BPE merge table and vocabulary.
- Feed the new domain corpus through the existing tokenizer to collect statistics on which merges would be most beneficial.
- Perform additional merge steps (e.g., 5 k–20 k merges) to create new tokens that truly capture novel morphemes or domain jargon.
Leaf‑based pruning – The BPE merge tree can be visualized as a hierarchy where leaf nodes are the smallest sub‑tokens. The pruning algorithm:
- Counts token usage on a validation set.
- Removes leaf tokens whose removal does not increase the total number of merges needed to reconstruct the original text (i.e., they are fully covered by higher‑level tokens).
- Re‑indexes the vocabulary, keeping the model’s embedding matrix size unchanged or optionally shrinking it.
Evaluation pipeline – The adapted tokenizers are plugged into pre‑trained models without any weight fine‑tuning, then the models are evaluated on standard benchmarks (e.g., GLUE, XNLI, domain‑specific classification tasks).

Results & Findings

Setting	Tokenizer size	% of new tokens actually used	Downstream accuracy change
Naïve extension (10 k new tokens)	+10 k	~12 %	–0.3 % (GLUE avg.)
Continued BPE (10 k new merges)	+10 k	~68 %	+0.6 % (GLUE avg.)
Continued BPE + leaf pruning (net –2 k)	–2 k (vs. original)	N/A	+0.5 % (GLUE avg.)
Multilingual XLM‑R (Russian domain)	+5 k → –1 k after pruning	73 %	+1.2 % (XNLI RU)

Higher utilization: Continued BPE makes the added vocabulary much more useful (up to 5‑6× higher token usage).
No degradation: Pruning removes up to ~20 % of the original vocab without measurable loss, sometimes even yielding slight gains due to reduced token fragmentation.
Speed & memory: Smaller, cleaner vocabularies lead to ~3 % faster tokenization and a modest reduction in GPU memory (fewer embedding look‑ups).

Practical Implications

Domain adaptation made cheap – You can retrofit an existing BERT‑style model to a specialized corpus (legal docs, medical notes, code snippets) by running a quick continued‑BPE pass instead of full model re‑training.
Multilingual roll‑outs – For low‑resource languages, extending a multilingual tokenizer with a few thousand merges can capture language‑specific morphemes without inflating the shared vocab.
Deployments with tight memory budgets – Leaf pruning can shave off unused embeddings, which is valuable for edge devices or serverless inference where every megabyte counts.
Tooling integration – The authors’ open‑source package plugs into Hugging Face’s tokenizers library, meaning you can add a single line like
```
adapt_tokenizer(model, new_corpus, merges=8000)
```
to your data‑pipeline.

Limitations & Future Work

Dependency on original BPE quality – If the base tokenizer was poorly trained (e.g., too small vocab), continued BPE can only do so much; the authors note diminishing returns for extremely low‑capacity vocabularies.
Static embeddings – The study keeps the model weights frozen; coupling tokenizer adaptation with lightweight embedding fine‑tuning could unlock further gains, which the authors leave for future exploration.
Evaluation scope – Experiments focus on classification benchmarks; generation tasks (e.g., summarization, translation) may react differently to token changes and merit separate study.
Automation – Deciding the optimal number of additional merges or pruning thresholds currently requires manual tuning; an adaptive stopping criterion is a promising next step.

Authors

Taido Purason
Pavel Chizhov
Ivan P. Yamshchikov
Mark Fishel

Paper Information

arXiv ID: 2512.03989v1
Categories: cs.CL
Published: December 3, 2025
PDF: Download PDF

[Paper] Teaching Old Tokenizers New Words: Efficient Tokenizer Adaptation for Pre-trained Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Enhancing Retrieval-Augmented Generation with Entity Linking for Educational Platforms

[Paper] M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

[Paper] Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding

[Paper] To Err Is Human: Systematic Quantification of Errors in Published AI Papers via LLM Analysis