[Paper] Teaching Old Tokenizers New Words: Efficient Tokenizer Adaptation for Pre-trained Models
Source: arXiv - 2512.03989v1
Overview
The paper tackles a surprisingly practical problem: how to adapt the tokenizer of a large pre‑trained language model when you move it to a new domain or language. Rather than retraining a whole model, the authors show that modest, targeted changes to the tokenizer—extending it efficiently and pruning unused pieces—can boost performance and reduce waste, all while keeping the original model intact.
Key Contributions
- Continued BPE training: a method that “continues” the byte‑pair‑encoding merge process on domain‑specific data, avoiding the proliferation of dead tokens that plague naïve vocabulary extension.
- Leaf‑based vocabulary pruning: an algorithm that safely removes redundant sub‑tokens (leaf nodes) from the BPE tree, shrinking the vocabulary without hurting downstream accuracy.
- Comprehensive evaluation across several languages (English, Russian, Finnish, etc.) and model families (BERT, RoBERTa, XLM‑R), demonstrating consistent gains in tokenization efficiency and downstream task scores.
- Open‑source toolkit: a ready‑to‑use Python package that lets practitioners extend or prune tokenizers with a few lines of code.
Methodology
-
Baseline tokenizer extension – The usual recipe: train a fresh BPE tokenizer on the new corpus, then append any new tokens to the existing vocabulary. This often creates many tokens that never appear in practice because the original tokenizer already covers most sub‑words.
-
Continued BPE training – Instead of starting from scratch, the authors resume the original BPE merge operations on the new data. Concretely:
- Load the original BPE merge table and vocabulary.
- Feed the new domain corpus through the existing tokenizer to collect statistics on which merges would be most beneficial.
- Perform additional merge steps (e.g., 5 k–20 k merges) to create new tokens that truly capture novel morphemes or domain jargon.
-
Leaf‑based pruning – The BPE merge tree can be visualized as a hierarchy where leaf nodes are the smallest sub‑tokens. The pruning algorithm:
- Counts token usage on a validation set.
- Removes leaf tokens whose removal does not increase the total number of merges needed to reconstruct the original text (i.e., they are fully covered by higher‑level tokens).
- Re‑indexes the vocabulary, keeping the model’s embedding matrix size unchanged or optionally shrinking it.
-
Evaluation pipeline – The adapted tokenizers are plugged into pre‑trained models without any weight fine‑tuning, then the models are evaluated on standard benchmarks (e.g., GLUE, XNLI, domain‑specific classification tasks).
Results & Findings
| Setting | Tokenizer size | % of new tokens actually used | Downstream accuracy change |
|---|---|---|---|
| Naïve extension (10 k new tokens) | +10 k | ~12 % | –0.3 % (GLUE avg.) |
| Continued BPE (10 k new merges) | +10 k | ~68 % | +0.6 % (GLUE avg.) |
| Continued BPE + leaf pruning (net –2 k) | –2 k (vs. original) | N/A | +0.5 % (GLUE avg.) |
| Multilingual XLM‑R (Russian domain) | +5 k → –1 k after pruning | 73 % | +1.2 % (XNLI RU) |
- Higher utilization: Continued BPE makes the added vocabulary much more useful (up to 5‑6× higher token usage).
- No degradation: Pruning removes up to ~20 % of the original vocab without measurable loss, sometimes even yielding slight gains due to reduced token fragmentation.
- Speed & memory: Smaller, cleaner vocabularies lead to ~3 % faster tokenization and a modest reduction in GPU memory (fewer embedding look‑ups).
Practical Implications
-
Domain adaptation made cheap – You can retrofit an existing BERT‑style model to a specialized corpus (legal docs, medical notes, code snippets) by running a quick continued‑BPE pass instead of full model re‑training.
-
Multilingual roll‑outs – For low‑resource languages, extending a multilingual tokenizer with a few thousand merges can capture language‑specific morphemes without inflating the shared vocab.
-
Deployments with tight memory budgets – Leaf pruning can shave off unused embeddings, which is valuable for edge devices or serverless inference where every megabyte counts.
-
Tooling integration – The authors’ open‑source package plugs into Hugging Face’s
tokenizerslibrary, meaning you can add a single line likeadapt_tokenizer(model, new_corpus, merges=8000)to your data‑pipeline.
Limitations & Future Work
- Dependency on original BPE quality – If the base tokenizer was poorly trained (e.g., too small vocab), continued BPE can only do so much; the authors note diminishing returns for extremely low‑capacity vocabularies.
- Static embeddings – The study keeps the model weights frozen; coupling tokenizer adaptation with lightweight embedding fine‑tuning could unlock further gains, which the authors leave for future exploration.
- Evaluation scope – Experiments focus on classification benchmarks; generation tasks (e.g., summarization, translation) may react differently to token changes and merit separate study.
- Automation – Deciding the optimal number of additional merges or pruning thresholds currently requires manual tuning; an adaptive stopping criterion is a promising next step.
Authors
- Taido Purason
- Pavel Chizhov
- Ivan P. Yamshchikov
- Mark Fishel
Paper Information
- arXiv ID: 2512.03989v1
- Categories: cs.CL
- Published: December 3, 2025
- PDF: Download PDF