Tokenizers The Building Blocks of Generative AI
Source: Dev.to
Originally written in 2023. Republished here.
What are tokenizers?
Tokenizers are algorithms that split a given input into smaller units called tokens, which can be processed by a generative AI model. Tokens may be words, characters, subwords, or even pixels, depending on the data type and granularity.
The output of a tokenizer is a sequence of tokens, each represented by a unique numerical identifier (token ID). These IDs are fed into the model as input or used to decode the model’s output. For example, a text tokenizer might map the word “hello” to token ID 1234 and “world” to token ID 5678. The input sequence [1234, 5678] can then generate a new output such as [7890, 4321], which is decoded back to words using the same tokenizer.
How do tokenizers work?
Character‑level tokenizers
These split the input into individual characters (letters, digits, punctuation, symbols). They are simple and flexible but can produce long token sequences and have a limited vocabulary size.
Word‑level tokenizers
These split the input into words based on whitespace and punctuation. They are intuitive and easy to understand, yet they may encounter out‑of‑vocabulary (OOV) tokens and misspellings.
Subword‑level tokenizers
These break the input into subwords—smaller units that capture common prefixes, suffixes, and stems. Subword tokenizers are more efficient and robust, handling OOV tokens and rare words, though they can sometimes produce unnatural splits and ambiguity.
Pixel‑level tokenizers
These split images into pixels, the smallest units of visual data. Pixel tokenizers are straightforward and universal but can result in high‑dimensional, noisy input representations.
Why are tokenizers important for generative AI?
Tokenizers enable models to learn from and generate diverse, complex data. Their impact includes:
- Data representation – Determines how input and output are encoded, influencing the information and structure the model can capture.
- Data processing – Affects the speed and efficiency of computation and generation by shaping how data is processed and decoded.
- Data quality – Influences the accuracy and diversity of the model’s output through the way data is split and mapped.
Thank you for reading this article—have fun with generative AI! 🤖