6. What Is the Model Actually Looking At?
In the last blog, we talked about how probabilities turn into decisions.
Thresholds. Trade-offs. Metrics.
But there’s something deeper hiding underneath all of that.
We need to ask: What is the model actually seeing?
Because models don’t really see text. They see numbers. And the way we turn language into numbers determines what kinds of mistakes are even possible.
Binary classification is easy.
But multiclass changes the geometry.
Instead of:
-
True positive
-
False positive
-
True negative
-
False negative
We now have a k × k confusion matrix.
In a native multiclass confusion matrix, classes are not literally re-labeled positive/negative; that’s an interpretation used to compute class-wise metrics.
Now, apart from asking: How often were we right?
We should also ask:
-
Which classes are being confused?
-
Are rare classes being ignored?
-
Are two labels systematically collapsing into one?
-
Is the model learning boundaries or just frequency?
Accuracy becomes less diagnostic. Because accuracy hides where we are wrong. And in multiclass settings, where we are wrong matters more than how often.
Consider this:
-
270 examples overall
-
135 positive
-
100 negative
-
35 “meh”
If the model performs well on positive and negative but struggles on “meh,”
Overall accuracy may look strong.
But if “meh” is the label we care about, then we are failing.
This is why macro vs micro averaging exists.
Neither is correct. Each encodes a value judgment.
Evaluation reflects priorities, and it's not neutral.
Even deeper than metrics is representation.
Before a model can misclassify a word, before it can confuse classes, before it can overfit, it has to represent text numerically. And how we do that changes everything.
Early NLP was brutally simple.
Take a vocabulary. Count how often each word appears.
Each document becomes a vector of counts. This gives us a term-document matrix.
High-dimensional. Sparse. Mostly zeros.
This representation captures frequency. But not meaning.
The word “good” might appear everywhere. But it doesn’t tell us why it appears.
Frequency alone is crude. So we improved it.
TF-IDF asks:
The answer is usually no.
If a word appears in every document:
IDF ≈ log(1) = 0.
It contributes nothing. TF-IDF doesn’t understand meaning.
But it begins to capture informativeness.
It says: Not all words are equally useful.
That’s a step toward structure.
Pointwise Mutual Information (PMI) goes further.
It asks: Do two words occur together more often than chance?
Formally: PMI(w, d) = log( P(w, d) / (P(w) P(d)) )
The idea is straightforward.
The denominator, P(w) P(d), represents what we would expect if the word and the document were independent.
In other words, if nothing special is happening, how often should they co-occur?
The numerator, P(w, d), is what we actually observe.
So PMI compares: What happened vs What randomness would predict
If the ratio is greater than 1, the word appears more often than expected. PMI becomes positive. If the ratio equals 1, the word behaves independently. PMI is zero. If the ratio is less than 1, the word appears less often than expected. PMI becomes negative.
The log smooths the scale. It makes interpretation cleaner. It turns multiplicative relationships into additive ones. So PMI is essentially measuring deviation from independence.
Then comes the key idea. A word is characterized by the company it keeps.
If two words appear in similar contexts, they probably mean similar things.
This is the distributional hypothesis.
Instead of defining meaning explicitly, we infer it from usage patterns.
Words become vectors. Location in space becomes meaning. Similarity becomes geometric.
Now:
-
“king” - “man” + “woman” ≈ “queen”
-
“dog” is close to “puppy”
-
“doctor” is closer to “hospital” than to “mountain”
The geometry reflects relationships. Meaning becomes distance.
There are two broad types of embeddings.
-
Very high-dimensional
-
Mostly zeros
-
Interpretable
-
Based on surface statistics
-
Low-dimensional (100-1000)
-
Continuous values
-
Capture semantic similarity
-
Harder to interpret directly
Sparse representations tell us: “What words occurred?”
Dense representations tell us: “What relationships exist?”
This shift changed NLP. But it didn’t change the evaluation pipeline.
We still:
-
Produce scores
-
Convert to probabilities
-
Choose labels
-
Compute metrics
The logic remains.
The representation changed.
Here’s the part that matters most.
If two words never co-occur, a sparse model will treat them as unrelated.
Dense embeddings can generalize beyond direct co-occurrence.
That means: Different representations make different mistakes.
Some errors are impossible in one space, and inevitable in another.
So when we evaluate a model, we’re not just evaluating the classifier.
We’re evaluating:
-
The representation
-
The geometry
-
The assumptions about meaning
In multiclass settings, a confusion matrix is a diagnostic lens.
It tells us:
-
Which representations collapse distinctions
-
Which classes overlap in feature space
-
Whether the model is confusing meaning or frequency
If two classes are constantly confused, that might be a representation problem.
Large language models don’t use bag-of-words.
They use contextual embeddings.
But the same principle holds.
When we evaluate LLM behavior, we are evaluating the geometry of those spaces.
And when we talk about confidence, we are implicitly trusting that geometry.
That trust deserves scrutiny.
We’ve now connected three layers:
-
Decisions
-
Evaluation
-
Representation
We’ve seen that:
But there’s one major shift we’ve only hinted at. Dense embeddings were learned.
And one of the models that changed everything was Word2Vec. It started with something simpler. Predict context. Or predict a word from its context. That’s it.
From that objective alone, geometry emerged.
And under the hood, something even more elegant was happening:
Skip-gram with Negative Sampling.
Instead of learning meaning directly, the model learned to distinguish:
-
Real word-context pairs
-
Randomly sampled fake pairs
Meaning was inferred from contrast. Which raises deeper questions:
-
Why does predicting context create semantic structure?
-
Why does negative sampling work?
-
What is the model actually optimizing?
-
And how does this connect back to PMI and mutual information?
In the next post, we’ll unpack:
-
Word2Vec intuition
-
Skip-gram architecture
-
Negative sampling as implicit matrix factorization
-
The hidden connection to PMI
-
And why dense embeddings changed NLP permanently
Because once we understand how embeddings are learned, we start seeing the math underneath. Thanks for reading and learning along with me. See you in the next one!
Comments
Post a Comment