6. What Is the Model Actually Looking At?

In the last blog, we talked about how probabilities turn into decisions.

Thresholds. Trade-offs. Metrics.

But there’s something deeper hiding underneath all of that.

Before we evaluate decisions,

before we argue about precision vs recall,

before we tune thresholds,

We need to ask: What is the model actually seeing?

Because models don’t really see text. They see numbers. And the way we turn language into numbers determines what kinds of mistakes are even possible.

Multiclass Evaluation Gets Messy Fast

Binary classification is easy.

One positive class.

One negative class.

A clean 2×2 confusion matrix.

But multiclass changes the geometry.

Instead of:

True positive
False positive
True negative
False negative

We now have a k × k confusion matrix.

In a native multiclass confusion matrix, classes are not literally re-labeled positive/negative; that’s an interpretation used to compute class-wise metrics.

Now, apart from asking: How often were we right?

We should also ask:

Which classes are being confused?
Are rare classes being ignored?
Are two labels systematically collapsing into one?
Is the model learning boundaries or just frequency?

Accuracy becomes less diagnostic. Because accuracy hides where we are wrong. And in multiclass settings, where we are wrong matters more than how often.

Rare Classes Reveal What We Actually Care About

Consider this:

270 examples overall
135 positive
100 negative
35 “meh”

If the model performs well on positive and negative but struggles on “meh,”

Overall accuracy may look strong.

But if “meh” is the label we care about, then we are failing.

This is why macro vs micro averaging exists.

Macro averaging treats all classes equally.

Micro averaging weights by frequency.

Neither is correct. Each encodes a value judgment.

Micro reflects the dataset distribution.

Macro reflects uniform class importance.

Evaluation reflects priorities, and it's not neutral.

But Evaluation Isn’t the Root Problem

Even deeper than metrics is representation.

Before a model can misclassify a word, before it can confuse classes, before it can overfit, it has to represent text numerically. And how we do that changes everything.

Bag-of-Words: Counting Without Meaning

Early NLP was brutally simple.

Take a vocabulary. Count how often each word appears.

Each document becomes a vector of counts. This gives us a term-document matrix.

High-dimensional. Sparse. Mostly zeros.

This representation captures frequency. But not meaning.

The word “good” might appear everywhere. But it doesn’t tell us why it appears.

Frequency alone is crude. So we improved it.

TF-IDF: Downweighting the Obvious

TF-IDF asks:

If a word appears in every document,

Is it really informative?

The answer is usually no.

Term Frequency (TF) measures how often a word appears in a document.

Inverse Document Frequency (IDF) downweights words that appear everywhere.

If a word appears in every document:

IDF ≈ log(1) = 0.

It contributes nothing. TF-IDF doesn’t understand meaning.

But it begins to capture informativeness.

It says: Not all words are equally useful.

That’s a step toward structure.

PMI: Pointwise Mutual Information

Pointwise Mutual Information (PMI) goes further.

It asks: Do two words occur together more often than chance?

Formally: PMI(w, d) = log( P(w, d) / (P(w) P(d)) )

The idea is straightforward.

The denominator, P(w) P(d), represents what we would expect if the word and the document were independent.

In other words, if nothing special is happening, how often should they co-occur?

The numerator, P(w, d), is what we actually observe.

So PMI compares: What happened vs What randomness would predict

If the ratio is greater than 1, the word appears more often than expected. PMI becomes positive. If the ratio equals 1, the word behaves independently. PMI is zero. If the ratio is less than 1, the word appears less often than expected. PMI becomes negative.

The log smooths the scale. It makes interpretation cleaner. It turns multiplicative relationships into additive ones. So PMI is essentially measuring deviation from independence.

The Distributional Hypothesis

Then comes the key idea. A word is characterized by the company it keeps.

If two words appear in similar contexts, they probably mean similar things.

This is the distributional hypothesis.

Instead of defining meaning explicitly, we infer it from usage patterns.

Words become vectors. Location in space becomes meaning. Similarity becomes geometric.

Now:

“king” - “man” + “woman” ≈ “queen”
“dog” is close to “puppy”
“doctor” is closer to “hospital” than to “mountain”

The geometry reflects relationships. Meaning becomes distance.

Sparse vs Dense Representations

There are two broad types of embeddings.

Sparse (Counts, TF-IDF)

Very high-dimensional
Mostly zeros
Interpretable
Based on surface statistics

Dense (Word2Vec, GloVe, BERT)

Low-dimensional (100-1000)
Continuous values
Capture semantic similarity
Harder to interpret directly

Sparse representations tell us: “What words occurred?”

Dense representations tell us: “What relationships exist?”

This shift changed NLP. But it didn’t change the evaluation pipeline.

We still:

Produce scores
Convert to probabilities
Choose labels
Compute metrics

The logic remains.

The representation changed.

Representation Shapes Error

Here’s the part that matters most.

If two words never co-occur, a sparse model will treat them as unrelated.

Dense embeddings can generalize beyond direct co-occurrence.

That means: Different representations make different mistakes.

Some errors are impossible in one space, and inevitable in another.

So when we evaluate a model, we’re not just evaluating the classifier.

We’re evaluating:

The representation
The geometry
The assumptions about meaning

Confusion Matrices as Diagnostic Tools

In multiclass settings, a confusion matrix is a diagnostic lens.

It tells us:

Which representations collapse distinctions
Which classes overlap in feature space
Whether the model is confusing meaning or frequency

If two classes are constantly confused, that might be a representation problem.

Why This Still Matters for LLMs

Large language models don’t use bag-of-words.

They use contextual embeddings.

But the same principle holds.

Before they generate,

before they classify,

before they route or extract, they represent.

Token embeddings.

Contextual layers.

Vector spaces.

When we evaluate LLM behavior, we are evaluating the geometry of those spaces.

And when we talk about confidence, we are implicitly trusting that geometry.

That trust deserves scrutiny.

We’ve now connected three layers:

Decisions
Evaluation
Representation

We’ve seen that:

Metrics tell us what we care about.

Confusion matrices tell us where we fail.

Representations determine what we can even learn.

But there’s one major shift we’ve only hinted at. Dense embeddings were learned.

And one of the models that changed everything was Word2Vec. It started with something simpler. Predict context. Or predict a word from its context. That’s it.

From that objective alone, geometry emerged.

Similarity emerged.

Analogies emerged.

Structure emerged.

And under the hood, something even more elegant was happening:

Skip-gram with Negative Sampling.

Instead of learning meaning directly, the model learned to distinguish:

Real word-context pairs
Randomly sampled fake pairs

Meaning was inferred from contrast. Which raises deeper questions:

Why does predicting context create semantic structure?
Why does negative sampling work?
What is the model actually optimizing?
And how does this connect back to PMI and mutual information?

In the next post, we’ll unpack:

Word2Vec intuition
Skip-gram architecture
Negative sampling as implicit matrix factorization
The hidden connection to PMI
And why dense embeddings changed NLP permanently

Because once we understand how embeddings are learned, we start seeing the math underneath. Thanks for reading and learning along with me. See you in the next one!

Search This Blog

NLP 101

6. What Is the Model Actually Looking At?

Comments

Post a Comment

Popular posts from this blog

1. Vocabulary, Tokenization & Byte Pair Encoding

4. From Predicting Words to Making Decisions