NLP 101

Posts

Showing posts from November, 2025

2. Why Early Language Models Failed: Data Sparsity and the Classical Fixes

November 26, 2025

In this blog, we’ll discuss data sparsity, smoothing, discounting, backoff, interpolation, and autoregressive generation, which are the core ideas that allowed early language models to function long before neural networks existed. Last week, we talked about how text becomes tokens. Now that we know how a model sees text, it’s time to talk about one of the oldest, deepest problems in NLP: What happens when a model tries to predict something it has never seen before? This issue is called data sparsity , and understanding it is essential to understanding why modern neural models replaced older statistical ones. Before deep learning, NLP relied heavily on n-gram models. T hese were simple, count-based statistical models that estimated the probability of the next word by looking at how frequently words appeared together in the training data. But even these simple models ran into a surprisingly difficult problem very quickly. Data Sparsity Imagine usin...

1. Vocabulary, Tokenization & Byte Pair Encoding

November 16, 2025

This is the beginning of my NLP 101 series, where I break down core concepts in a simple, intuitive way, both as a way to practice and understand the concepts better myself and to share them with the community. Let's begin by understanding one of the most fundamental parts of language processing - how machines read text. Humans understand “words” intuitively, but when we try to teach a machine, we have to define them. Is “New York” one word or two? Is “running” the same as “run” ? Is “AI-generated” one token, two, or three? You’re making a philosophical decision about what counts as meaning. But let's start from the basics. It all starts with vocabulary: a collection of the unique tokens a model knows. But here’s the catch: language is infinite. New words appear, typos happen, names vary. So the question becomes - how do we make a finite model understand an infinite language? That’s where to...