3. How Do We Know a Language Model Is Any Good?

So far in this series, we’ve talked about how text becomes tokens and why early language models struggled with data sparsity. Now comes the obvious next question:

How do we actually know whether a language model is good or bad?

This turns out to be much harder than it sounds.

Unlike image classification, where a model either labels an image correctly or it doesn’t, language models don’t usually have one “right” answer. For most inputs, there are many perfectly reasonable continuations.

So evaluation in NLP is less about absolute correctness and more about how surprised the model is by real language and how useful it is in practice.


Intrinsic vs Extrinsic Evaluation

Broadly, there are two ways to evaluate language models.

Intrinsic evaluation measures the quality of the language model directly.
You ask: How well does the model predict real text?

Extrinsic evaluation measures performance on a downstream task.
You ask: Does this model help me do something useful, like classification, translation, or search?

Intrinsic evaluation is faster and cheaper.
Extrinsic evaluation is slower, more expensive, but closer to real-world usefulness.

Both matter, and they often tell different stories.


The Core Intuition: Surprise

At the heart of language model evaluation is a simple idea:

A good language model should not be surprised by real language.

If a model consistently assigns low confidence to text that humans find completely normal, that’s a bad sign. If it confidently predicts what comes next, that’s a good sign.

So evaluation often boils down to measuring how confused or surprised the model is when reading real text.


Perplexity

Historically, the most common intrinsic evaluation metric for language models has been perplexity.

Think of perplexity as answering this question:

On average, how many reasonable choices does the model think it has for the next word?

If a model is very uncertain, it’s like saying:
“I have no idea what comes next; it could be one of many things.”

If a model is confident, it’s like saying:
“I pretty much expected this.”

Lower perplexity means:

  • Fewer guesses

  • Less surprise

  • Better language modeling

Higher perplexity means:

  • More guessing

  • More confusion

  • Worse modeling of language

A model that has learned useful patterns will have lower perplexity on real text.


Perplexity: Pros and Cons

Perplexity is attractive because it’s:

  • Fast to compute

  • Easy to compare during training

  • Directly tied to next-token prediction

But it comes with serious caveats.

You cannot fairly compare perplexity across models if:

  • They use different vocabularies

  • They use different tokenization methods

  • They were trained on different data

  • They handle unknown words differently

A model using character tokens and a model using subwords are not even speaking the same representational language. Comparing their perplexities directly is meaningless.

This is one reason many papers that claim “better perplexity” are fundamentally flawed. The numbers look scientific, but the comparison itself is invalid.

Perplexity is useful within the same experimental setup, not across arbitrary models.


When Lower Perplexity Still Doesn’t Help

Even more importantly, lower perplexity does not guarantee better performance on real tasks.

A model can be very good at predicting the next word and still be terrible at:

  • Sentiment analysis

  • Spam detection

  • Topic classification

  • Question answering

This is where extrinsic evaluation becomes essential.


Extrinsic Evaluation: Does It Actually Work?

Extrinsic evaluation asks a different question:

If I use this language model as part of a system, does the system perform better?

For example:

  • Does sentiment classification accuracy improve?

  • Does translation quality improve?

  • Does retrieval ranking improve?

Here, we care about metrics like:

  • Accuracy

  • Precision and recall

  • F1 score

  • Task-specific benchmarks

These evaluations are slower and more expensive, but they reflect real-world impact. A model that slightly worsens perplexity but dramatically improves task performance may still be the better model.


Why Evaluation Is Hard in Practice

Evaluating language models is messy because language itself is messy.

Small choices matter:

  • Vocabulary size

  • Tokenization strategy

  • Training data source

  • Dataset splits

  • Domain mismatch between training and testing

Even two models trained on “the same data” can behave very differently if preprocessing differs.

This is why responsible evaluation requires transparency and careful experimental control, and why results should always be interpreted with skepticism.


So… What Does “Good” Really Mean?

There is no single number that tells you whether a language model is good.

A good language model is one that:

  • Is not overly surprised by real text

  • Generalizes beyond what it has seen

  • Helps downstream tasks perform better

  • Is evaluated fairly and honestly

Perplexity gives us a window into model behavior.
Downstream tasks tell us whether that behavior is useful.

Both are necessary. Neither is sufficient alone.


Now that we know how to evaluate language models, we’re ready to move beyond generation and into decision-making.

Next up, we’ll shift from predicting words to predicting labels:

  • Text classification

  • Binary decisions

  • Probabilistic outputs

  • Why logistic regression became the workhorse of early NLP

Thanks for reading and learning along with me. See you next week!

Comments

Popular posts from this blog

1. Vocabulary, Tokenization & Byte Pair Encoding

6. What Is the Model Actually Looking At?

4. From Predicting Words to Making Decisions