1. Vocabulary, Tokenization & Byte Pair Encoding
Is “New York” one word or two?
Is “running” the same as “run”?
Is “AI-generated” one token, two, or three?
You’re making a philosophical decision about what counts as meaning.
It all starts with vocabulary: a collection of the unique tokens a model knows.
That’s where tokenization comes in.
Tokenization: Breaking Text into Meaning
Tokenization is the process of splitting text into smaller chunks called tokens that the model can work with.
In early NLP, they used simple approaches like:
Whitespace tokenization: split on spaces (“I love NLP” → [I, love, NLP])
Word-level vocabularies: one token per word
But both approaches quickly break when we see something new, like “loving” or “NLP” that aren’t in the vocabulary.
To fix that, NLP evolved toward subword tokenization.
Subword Tokenization
Subword tokenization finds a balance between word-level and character-level representations.
Instead of treating “playing,” “player,” and “playful” as totally different words, it breaks them into meaningful subunits like play + ing, or play + er.
This helps the model handle rare words while keeping the vocabulary manageable.
There are three major algorithms used in modern NLP:
Byte Pair Encoding (BPE)
WordPiece
Unigram
All three share a common goal: to make the model robust to unseen words, but they approach it differently.
Byte Pair Encoding (BPE)
BPE finds the most frequent pairs of characters in a text and merges them iteratively by building subword units that balance vocabulary size and coverage.
For example:
So instead of memorizing every word, the model learns reusable parts like “play,” “playing,” “player”.
All become combinations of known subwords.
That’s why even if you type something rare like “playfulishness,” your model won’t panic.
It just pieces together familiar tokens.
Over time, it learns common patterns like “ing,” “tion,” or “play.”
In simple terms, once a merge is chosen, it’s applied consistently throughout the corpus.
During preprocessing, each word is encoded as a sequence of these subwords
(for example: unbelievable → un + believ + able).
When decoding, the tokenizer simply reverses the process, joining subwords back into readable text.
Variants: WordPiece & Unigram
WordPiece (used in BERT) is similar to BPE but chooses merges based on maximizing likelihood under a language model rather than pure frequency.
Unigram (used in SentencePiece) starts with a large vocabulary and gradually removes tokens that contribute least to the model’s likelihood.
Each method balances vocabulary size, efficiency, and language coverage differently.
- how likely a token is
- how tokens depend on context
- how to generate sequences that make sense
Thanks for reading and learning along with me. Bye!
Comments
Post a Comment