5. Turning Probabilities into Decisions
In the last post, we talked about how NLP systems moved from generating language to making decisions.
But there’s a step in this pipeline that often gets looked over.
At some point, a probability becomes a decision.
And that step is where most failures happen.
Probabilities Are Not Decisions
A classifier rarely outputs a label directly.
It outputs something like: 0.73
That number doesn’t mean “spam.”
It means: “Given the model, the data, and the assumptions baked into training, this input looks spam-like with probability 0.73.”
To turn that into a decision, we introduce a threshold.
The most common threshold is 0.5.
Because it’s convenient.
And that convenience hides trade-offs.
Thresholds Encode Values
Changing the threshold changes behavior.
Lower the threshold:
-
More positives
-
Higher recall
-
More false positives
Raise the threshold:
-
Fewer positives
-
Higher precision
-
More false negatives
There is no universally “correct” threshold.
The same classifier, with the same probabilities, can be:
-
Too strict
-
Too lenient
-
Or just right
Depending entirely on where we draw the line.
Binary vs Multiclass
Binary classification looks clean:
Multiclass feels harder:
Topic A, B, C, D…
But under the hood, it is deeper than it looks.
In multiclass models:
-
We still output probabilities
-
We still pick the largest one
-
We still act as if confidence implies correctness
A model that outputs:
-
Class A: 0.34
-
Class B: 0.33
-
Class C: 0.33
will confidently choose Class A.
This is where evaluation becomes harder and more important.
From a 2×2 Table to Many Ways of Being Wrong
Binary classification gives us a simple 2×2 confusion matrix:
-
True positives
-
False positives
-
True negatives
-
False negatives
Multiclass classification generalizes this idea.
Instead of one positive class, every class becomes “positive” in turn, and the confusion matrix grows.
Now we have questions like:
-
Which classes are getting confused with each other?
-
Are rare classes being ignored?
-
Is the model good overall, but bad for specific labels?
Evaluation stops being a single number and becomes a structure.
Why Accuracy Can Mislead?
Accuracy still answers one question: “How often did we get it right?”
But in multiclass settings, that question becomes even less useful.
A model can:
-
Perform well on dominant classes
-
Completely fail on rare but important ones
-
Still report high accuracy
Accuracy collapses all mistakes into one number.
But good models care about which mistakes happen and where.
Precision, Recall, and Averaging Choices
Precision and recall still matter.
But now we have choices:
-
Compute them per class
-
Average them across classes
This is where micro vs macro averaging appears.
Macro averaging:
-
Treats all classes equally
-
Highlights poor performance on rare classes
Micro averaging:
-
Weighs classes by frequency
-
Reflects overall volume
Neither is “correct.”
Each one encodes a different priority.
Representation Comes Back Into the Picture
Up to now, we’ve talked as if the model is making decisions over “text.”
But models don’t see words. They see vectors.
Before classification, text must be turned into numbers.
Early NLP did this with:
-
Bag-of-words counts
-
TF-IDF
-
PMI-based word associations
TF-IDF answers:
-
How important is this word to this document, relative to the corpus?
PMI asks:
-
Do two words co-occur more often than chance would suggest?
Both are attempts to quantify relationships in language using statistics.
They shaped what classifiers could learn.
Embeddings Generalized These Ideas
Modern embeddings feel very different.
But conceptually, they’re doing the same thing:
-
Representing words and documents numerically
-
Encoding relationships
-
Making similarity computable
The difference is scale and smoothness.
Before models can decide well, they have to see well.
Logistic Regression’s Legacy
Logistic regression shaped early NLP because it worked.
But more importantly, it shaped how we think.
It taught us to:
-
Predict probabilities and derive labels from them
-
Separate scoring from decision-making
-
Evaluate behavior and not rely just on accuracy
-
Reason explicitly about uncertainty
Even modern neural classifiers follow the same pipeline:
Linear scores → nonlinearity → probabilities → thresholds → decisions
Why This Still Matters Today?
Large language models feel different.
But the moment an LLM:
-
Flags content
-
Extracts fields
-
Routes a request
-
Triggers an action
It becomes a classifier again.
The same old problems, just at a larger scale.
Where This Takes Us Next?
Once we understand how models decide, another layer becomes unavoidable:
In the next post, we’ll go deeper into representation and reliability:
-
Multiclass evaluation in practice
-
Confusion matrices as diagnostic tools
-
TF-IDF and PMI as early meaning models
-
Embeddings and what they really changed
Before we can trust decisions, we need to understand what the model is actually looking at.
Thanks for reading and learning along with me. See you in the next one!
Comments
Post a Comment