[Paper] Bolmo: Byteifying the Next Generation of Language Models

Published: 1 month ago (December 17, 2025 at 11:46 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.15586v1

Overview

The paper presents Bolmo, a new family of byte‑level language models that match (and sometimes exceed) the performance of popular subword‑based models while keeping the advantages of operating directly on raw bytes. By “byteifying” existing subword models instead of training from scratch, the authors show that developers can obtain high‑quality, character‑aware LMs with a fraction of the usual pre‑training cost.

Key Contributions

Byteification pipeline: A method to convert any pretrained subword LM into a byte‑level LM using an exact distillation objective, requiring < 1 % of the typical pre‑training token budget.
Architectural redesign: Introduces a byte‑level architecture that aligns the expressivity of byte models with that of their subword counterparts, eliminating the bottleneck that plagued earlier byte‑level LMs.
Competitive performance: Bolmo‑1B and Bolmo‑7B achieve state‑of‑the‑art results among byte‑level models and rival the original subword models on most benchmarks, while excelling at character‑level tasks and certain coding evaluations.
Efficient inference: By training with higher token‑compression ratios, Bolmo attains inference speeds comparable to subword models, debunking the myth that byte models are inherently slower.
Low‑cost post‑training: Demonstrates that Bolmo can be fine‑tuned with the same tooling and data pipelines used for its subword progenitor, enabling rapid adaptation to new domains.

Methodology

Start with a pretrained subword LM (e.g., a 1B‑parameter Transformer trained on BPE tokens).
Design a byte‑level Transformer whose hidden‑size and depth mirror the source model but whose input embedding layer operates on the 256 possible byte values.
Exact distillation: For each subword token in the original model’s training data, the corresponding byte sequence is fed to the byte model. The byte model is trained to reproduce the exact hidden states and next‑token logits of the subword model, using a mean‑squared error loss on hidden representations plus a cross‑entropy loss on logits.
Token‑compression training: The byte model processes longer byte streams but is trained to predict the same number of subword tokens, effectively learning to “compress” multiple bytes into a single prediction step.
Fine‑tuning (optional): After distillation, the byte model can be further trained on downstream data (e.g., code corpora) using the standard language‑model objective.

The whole pipeline requires only a small additional token budget because the heavy lifting—learning linguistic knowledge—is already captured by the source subword model.

Results & Findings

Model	Params	Byte‑level?	Avg. GLUE	CodeEval	Char‑level QA
Subword (baseline)	1B	No	84.2	71.5	78.1
Bolmo‑1B	1B	Yes	83.8	73.2	80.4
Prior Byte‑LM	1B	Yes	71.5	58.0	65.3
Subword (baseline)	7B	No	86.7	78.9	81.5
Bolmo‑7B	7B	Yes	86.3	80.1	83.0

Performance parity: Bolmo matches or slightly trails the original subword models on standard NLP benchmarks (GLUE), while outperforming them on character‑intensive tasks.
Coding advantage: On code‑generation benchmarks, Bolmo’s byte‑level granularity yields a modest but consistent boost.
Speed: With a token‑compression ratio of ~4 bytes per subword token, Bolmo’s throughput is within 5 % of the subword baseline on modern GPUs.
Training efficiency: The distillation step consumes roughly 0.8 % of the token count needed for full pre‑training, translating to a cost reduction of > 90 % compared with training a byte model from scratch.

Practical Implications

Simplified pipelines: Developers can keep using existing tokenizers and datasets while swapping in a byte‑level model for tasks that demand fine‑grained character handling (e.g., multilingual text with rare scripts, DNA sequences, or source code).
Robustness to OOV: Byte models naturally handle any Unicode input without needing vocabulary extensions, reducing maintenance overhead for products that ingest user‑generated content.
Security & sanitization: Byte‑level LMs can detect and mitigate malicious payloads that exploit subword tokenization quirks (e.g., hidden characters or obfuscated code).
Cost‑effective adaptation: Companies can “byteify” their proprietary subword LMs to gain the above benefits without incurring the massive compute expense of a full pre‑training run.
Edge deployment: Because the byte vocabulary is fixed at 256 entries, the embedding matrix is tiny, which can be advantageous for memory‑constrained environments (mobile, IoT).

Limitations & Future Work

Slight performance gap on some high‑level semantic benchmarks (e.g., entailment) where subword tokenization still offers a marginal edge.
Distillation quality depends on the source model; errors or biases in the original subword LM can propagate to the byte model.
Compression trade‑offs: Aggressive token‑compression improves speed but may degrade performance on very long‑range dependencies; finding the optimal ratio per task remains an open question.
Future directions suggested by the authors include: extending byteification to multimodal models, exploring mixed tokenization schemes (byte + subword hybrids), and applying the technique to even larger scales (≥ 30B parameters) to test scalability.

Authors

Benjamin Minixhofer
Tyler Murray
Tomasz Limisiewicz
Anna Korhonen
Luke Zettlemoyer
Noah A. Smith
Edoardo M. Ponti
Luca Soldaini
Valentin Hofmann

Paper Information

arXiv ID: 2512.15586v1
Categories: cs.CL
Published: December 17, 2025
PDF: Download PDF

[Paper] Bolmo: Byteifying the Next Generation of Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] When Reasoning Meets Its Laws

[Paper] ShareChat: A Dataset of Chatbot Conversations in the Wild

[Paper] DEER: A Comprehensive and Reliable Benchmark for Deep-Research Expert Reports

[Paper] Bangla MedER: Multi-BERT Ensemble Approach for the Recognition of Bangla Medical Entity