2. Why Early Language Models Failed: Data Sparsity and the Classical Fixes

In this blog, we’ll discuss data sparsity, smoothing, discounting, backoff, interpolation, and autoregressive generation, which are the core ideas that allowed early language models to function long before neural networks existed.

Last week, we talked about how text becomes tokens. Now that we know how a model sees text, it’s time to talk about one of the oldest, deepest problems in NLP: 

What happens when a model tries to predict something it has never seen before?

This issue is called data sparsity, and understanding it is essential to understanding why modern neural models replaced older statistical ones. 

Before deep learning, NLP relied heavily on n-gram models. 

These were simple, count-based statistical models that estimated the probability of the next word by looking at how frequently words appeared together in the training data. 

But even these simple models ran into a surprisingly difficult problem very quickly.


Data Sparsity

Imagine using n-grams to model language. 

A unigram model predicts words without any context.
A bigram model predicts the next word based on the previous one. 
A trigram model predicts the next word based on the previous two. 

In theory, increasing n should give you more context and better predictions.
In practice, expanding context makes the data sparsity problem exponentially worse. 

Language is vast. Even with massive corpora, most possible word combinations never appear in training. 

For example, you might see the sentence “I want Thai food tonight” many times, but never see “I want Ethiopian food tonight,” even though the second is perfectly natural English. 

An n-gram model that has never seen “Ethiopian food” simply assigns it a probability of 0. That means the entire sentence containing that phrase also gets a probability of 0. 

Any model that assigns zeros becomes unusable for generation, scoring, or prediction. If one piece in the chain has zero probability, the whole sequence collapses.

This simple fact, that most linguistically valid sequences never appear in training, has shaped decades of research.



1. Markov’s Independence Assumption

The first major idea to simplify language modeling was the Markov assumption. It states that the future depends only on a limited window of the past

For language models, this means we assume the next word depends only on the previous one (bigram) or previous two (trigram), rather than the entire sentence history. 

This drastically reduces the number of parameters the model must estimate.

For a trigram model, instead of worrying about the probability of a word given an entire preceding sentence, we only care about P(w3|w1,w2)

We only care about the probability of the n-th word given the previous n - 1 words.

This makes the problem computationally manageable, but it does nothing to solve the sparsity problem. Even with this simplified assumption, the number of possible trigram combinations is enormous, and most of them will not appear in training data.

The Berkeley Restaurant Project, a classic NLP dataset of restaurant-related queries, illustrates this clearly. 

Even in a tiny domain with only a few hundred sentences and a fairly limited vocabulary, most possible combinations of words still do not appear. 

This is when researchers realized that having a mathematical definition of context is not enough. We needed a way for the model to handle things it had never seen.

This leads naturally to the question: how do we prevent zero probabilities?



2. Smoothing: The First Fix

The earliest and simplest attempt to fix the sparsity problem was smoothing. 

The core idea behind smoothing is straightforward: if an n-gram never appears in training, assign it a small, non-zero probability. 

The first and most well-known method is Add-1 smoothing, also called Laplace smoothing. The idea is exactly what it sounds like: add 1 to every count, including counts that were originally zero.

If the bigram (“Ethiopian”, “food”) never appears, Add-1 smoothing pretends it occurred once. This eliminates zero probabilities and allows the model to produce nonzero outputs for unseen combinations.

However, Add-1 smoothing has major problems

  • It gives too much probability to events that never actually occur. 

  • Because you are adding 1 to everything, you must reduce the probabilities of frequently seen events to make the distribution sum to 1. This distorts the distribution. 

  • With a large vocabulary, you end up adding 1 to millions or billions of possible n-gram combinations, which makes all the numbers unrealistic. 

  • It ignores the fact that some unseen combinations are more plausible than others.

Because of these limitations, Add-1 smoothing is mostly used for teaching purposes today. It is conceptually simple but performs poorly in practice.

Smoothing helped remove zeros, but it introduced its own problems, which led researchers to better ideas.



3. Discounting: A Smarter Fix

Smoothing adds probability to unseen events by artificially increasing their counts. Discounting takes the opposite approach. 

Instead of adding fake counts, discounting slightly reduces the counts of observed events and uses the leftover probability mass to assign non-zero probability to unseen events. 

In other words, you “discount” the probability of events you have seen so that you can redistribute it to events you have not.

This approach is more grounded in real language behavior. Common n-grams remain common, but unseen n-grams still get some probability. 

There are several discounting methods, but two of the most influential are Absolute Discounting and Good-Turing Discounting.

  • Absolute Discounting subtracts a small constant from each count. If a trigram appears 12 times, you reduce it to something like 11.25. The saved probability is then redistributed to unseen trigrams. 

  • Good-Turing Discounting is even more elegant. It looks at the frequency of singleton events,  events that occur exactly once, to estimate how much probability should be allocated to unseen events. 

The reasoning is intuitive: if your dataset contains many events that occur once, it is likely that many more legitimate events are occurring zero times simply due to limited data. 

Good-Turing Discounting was used extensively in classical speech recognition systems.

This method addresses the sparsity issue much more coherently than Add-1 smoothing.



4. Backoff

Discounting improves probability estimates but still doesn’t fully solve the problem when higher-order n-grams are missing completely. 

Backoff provides an intuitive fix. The idea is simple: if you do not have enough evidence for a high-order n-gram, use a lower-order one.

If a trigram is unseen, fall back to the bigram.
If the bigram is unseen, fall back to the unigram.
If the unigram is unseen, use a global fallback probability.

By moving gradually to less specific models, you avoid zero probabilities and rely on a broader context when a specific context is missing.

The most famous version of backoff is Stupid Backoff, introduced by Google for use in large-scale web search language models

Stupid Backoff is called “stupid” because it ignores probability theory entirely. 

It simply checks for an n-gram’s presence and, if missing, backs off to a lower n-gram multiplied by a constant factor. 

Despite its simplicity, Stupid Backoff works remarkably well on massive datasets because raw frequency information becomes extremely reliable at scale.



5. Interpolation

While backoff chooses only one n-gram level (trigram, or bigram, or unigram), interpolation combines them. This is one of the most elegant classical solutions. 

Instead of backing off completely from trigram to bigram when data is missing, interpolation mixes all the available evidence by using a weighted sum of probabilities from different n-gram levels.

A trigram model would be computed as:

P = λ3 P3 + λ2 P2 + λ1 P1

The weights λ1, λ2, and λ3 sum to 1, and each represents how much confidence the model places in each level. 

When counts for a trigram are high, λ3 is dominant. 
When trigram data is unreliable or sparse, lower-order n-grams contribute more.

Unlike backoff, interpolation never throws away information.

This reduces variance in the predictions and produces smoother, more robust probability distributions. 

Interpolation became a core technique behind large-scale n-gram models like Microsoft’s and Google’s Infinigram models, which blended information from many different n-gram orders.



6. Autoregressive Generation

Once a model can assign non-zero probabilities to sequences, it can generate language autoregressively. The idea is simple. To generate text:

  • Use the current context to predict the next token.

  • Append that token to the sequence.

  • Use the updated sequence as context for the next prediction.

  • Repeat the process.

N-gram language models generated text this way, and modern transformers still do. 

The difference is that transformers learn their probability distributions through neural networks rather than raw counts. But the fundamental process of generation is still autoregressive.

This step-by-step prediction is why models like GPT can continue writing indefinitely based on the outputs they generate.



Summary of What Was Covered

In this part, we explored the foundational techniques that allowed early language models to function despite data sparsity.

We examined why n-gram models struggle with unseen word combinations, how the Markov assumption simplifies modeling but does not eliminate sparsity, and why early smoothing techniques like Add-1 smoothing were inadequate. 

We then looked at more effective solutions, such as discounting, which redistributes probability mass more realistically, and backoff, which switches to simpler models when data is lacking. 

Interpolation, which blends information from multiple n-gram orders, remains one of the most elegant and powerful classical techniques. 

Finally, we discussed autoregressive generation, which forms the backbone of both classical and modern language models.

These ideas provided the foundation for language modeling before neural networks, and they explain why the field eventually moved toward distributed representations and deep learning.


Next up: Evaluation metrics, text classification, and the basics of logistic regression.

Thanks for reading and learning along with me. See you next week! 

Comments

Popular posts from this blog

1. Vocabulary, Tokenization & Byte Pair Encoding

4. From Predicting Words to Making Decisions

6. What Is the Model Actually Looking At?