1. Vocabulary, Tokenization & Byte Pair Encoding

This is the beginning of my NLP 101 series, where I break down core concepts in a simple, intuitive way, both as a way to practice and understand the concepts better myself and to share them with the community.

Let's begin by understanding one of the most fundamental parts of language processing -how machines read text.

Humans understand “words” intuitively, but when we try to teach a machine, we have to define them.
Is “New York” one word or two?
Is “running” the same as “run”?
Is “AI-generated” one token, two, or three?
You’re making a philosophical decision about what counts as meaning.

But let's start from the basics.
It all starts with vocabulary: a collection of the unique tokens a model knows.
But here’s the catch: language is infinite.
New words appear, typos happen, names vary.
So the question becomes - how do we make a finite model understand an infinite language?

That’s where tokenization comes in.


Tokenization: Breaking Text into Meaning

Tokenization is the process of splitting text into smaller chunks called tokens that the model can work with.

In early NLP, they used simple approaches like:

  • Whitespace tokenization: split on spaces (“I love NLP” → [I, love, NLP])

  • Word-level vocabularies: one token per word

But both approaches quickly break when we see something new, like “loving” or “NLP” that aren’t in the vocabulary.

To fix that, NLP evolved toward subword tokenization.


Subword Tokenization

Subword tokenization finds a balance between word-level and character-level representations.

Instead of treating “playing,” “player,” and “playful” as totally different words, it breaks them into meaningful subunits like play + ing, or play + er.

This helps the model handle rare words while keeping the vocabulary manageable.

There are three major algorithms used in modern NLP:

  • Byte Pair Encoding (BPE)

  • WordPiece

  • Unigram

All three share a common goal: to make the model robust to unseen words, but they approach it differently.


Byte Pair Encoding (BPE)

BPE finds the most frequent pairs of characters in a text and merges them iteratively by building subword units that balance vocabulary size and coverage.

For example:

l o w e r
→ lo wer
lower

So instead of memorizing every word, the model learns reusable parts like “play,” “playing,” “player”.
All become combinations of known subwords.

That’s why even if you type something rare like “playfulishness,” your model won’t panic.
It just pieces together familiar tokens.

Over time, it learns common patterns like “ing,” “tion,” or “play.”

This results in a vocabulary that captures both frequent words and useful subword units.

It’s a greedy, deterministic algorithm.
In simple terms, once a merge is chosen, it’s applied consistently throughout the corpus.

During preprocessing, each word is encoded as a sequence of these subwords
(for example: unbelievable → un + believ + able).

When decoding, the tokenizer simply reverses the process, joining subwords back into readable text.


Variants: WordPiece & Unigram

  • WordPiece (used in BERT) is similar to BPE but chooses merges based on maximizing likelihood under a language model rather than pure frequency.

  • Unigram (used in SentencePiece) starts with a large vocabulary and gradually removes tokens that contribute least to the model’s likelihood.

Each method balances vocabulary sizeefficiency, and language coverage differently.



After Tokenization: Learning Language

Once tokenization defines how text is represented, language models learn:
  • how likely a token is
  • how tokens depend on context
  • how to generate sequences that make sense
Earlier models, like n-grams and feedforward networks, relied on strict independence assumptions. They only saw a limited window of previous words. This caused data sparsity problems (not enough examples of rare sequences).

Later, RNNs and Transformers solved this by capturing long-range dependencies, meaning they generate words based on context, not just isolated statistics.

But more on that in the upcoming weeks!
Next up: Data Sparsity Smoothing, Discounting, Interpolation, Evaluating LMs

Thanks for reading and learning along with me. Bye! 

Comments

Popular posts from this blog

4. From Predicting Words to Making Decisions

6. What Is the Model Actually Looking At?