5. Turning Probabilities into Decisions

In the last post, we talked about how NLP systems moved from generating language to making decisions.

We saw how classifiers produce probabilities.
How loss functions tell us how wrong we are.
How optimization nudges models toward better behavior.

But there’s a step in this pipeline that often gets looked over.

At some point, a probability becomes a decision.

And that step is where most failures happen.


Probabilities Are Not Decisions

A classifier rarely outputs a label directly.

It outputs something like: 0.73

That number doesn’t mean “spam.”

It means: “Given the model, the data, and the assumptions baked into training, this input looks spam-like with probability 0.73.”

To turn that into a decision, we introduce a threshold.

If probability ≥ threshold -> positive class
Otherwise -> negative class

The most common threshold is 0.5.

Because it’s convenient.

And that convenience hides trade-offs.


Thresholds Encode Values

Changing the threshold changes behavior.

Lower the threshold:

  • More positives

  • Higher recall

  • More false positives

Raise the threshold:

  • Fewer positives

  • Higher precision

  • More false negatives

There is no universally “correct” threshold.

A spam filter that misses spam is annoying.
A medical test that misses cancer is catastrophic.
A fraud model that blocks legitimate users loses trust.

The same classifier, with the same probabilities, can be:

  • Too strict

  • Too lenient

  • Or just right

Depending entirely on where we draw the line.

This is why classification is more than just modeling.
It’s a decision design.


Binary vs Multiclass

Binary classification looks clean:

Spam / not spam
Positive / negative

Multiclass feels harder:

Topic A, B, C, D…

But under the hood, it is deeper than it looks.

In multiclass models:

  • We still output probabilities

  • We still pick the largest one

  • We still act as if confidence implies correctness

A model that outputs:

  • Class A: 0.34

  • Class B: 0.33

  • Class C: 0.33

will confidently choose Class A.

The decision looks sharp.
The uncertainty is hidden.

This is where evaluation becomes harder and more important.


From a 2×2 Table to Many Ways of Being Wrong

Binary classification gives us a simple 2×2 confusion matrix:

  • True positives

  • False positives

  • True negatives

  • False negatives

Multiclass classification generalizes this idea.

Instead of one positive class, every class becomes “positive” in turn, and the confusion matrix grows.

Now we have questions like:

  • Which classes are getting confused with each other?

  • Are rare classes being ignored?

  • Is the model good overall, but bad for specific labels?

Evaluation stops being a single number and becomes a structure.


Why Accuracy Can Mislead?

Accuracy still answers one question: “How often did we get it right?”

But in multiclass settings, that question becomes even less useful.

A model can:

  • Perform well on dominant classes

  • Completely fail on rare but important ones

  • Still report high accuracy

Accuracy collapses all mistakes into one number.

But good models care about which mistakes happen and where.


Precision, Recall, and Averaging Choices

Precision and recall still matter.

But now we have choices:

  • Compute them per class

  • Average them across classes

This is where micro vs macro averaging appears.

Macro averaging:

  • Treats all classes equally

  • Highlights poor performance on rare classes

Micro averaging:

  • Weighs classes by frequency

  • Reflects overall volume

Neither is “correct.”

Each one encodes a different priority.

Metrics are important because they reveal what we care about.


Representation Comes Back Into the Picture

Up to now, we’ve talked as if the model is making decisions over “text.”

But models don’t see words. They see vectors.

Before classification, text must be turned into numbers.

Early NLP did this with:

  • Bag-of-words counts

  • TF-IDF

  • PMI-based word associations

These methods didn’t understand the meaning. They captured patterns of usage.

TF-IDF answers:

  • How important is this word to this document, relative to the corpus?

PMI asks:

  • Do two words co-occur more often than chance would suggest?

Both are attempts to quantify relationships in language using statistics.

They shaped what classifiers could learn.


Embeddings Generalized These Ideas

Modern embeddings feel very different.

Dense vectors.
Continuous spaces.
Semantic similarity.

But conceptually, they’re doing the same thing:

  • Representing words and documents numerically

  • Encoding relationships

  • Making similarity computable

The difference is scale and smoothness.

Before models can decide well, they have to see well.


Logistic Regression’s Legacy

Logistic regression shaped early NLP because it worked.

But more importantly, it shaped how we think.

It taught us to:

  • Predict probabilities and derive labels from them

  • Separate scoring from decision-making

  • Evaluate behavior and not rely just on accuracy

  • Reason explicitly about uncertainty

Even modern neural classifiers follow the same pipeline:

Linear scores → nonlinearity → probabilities → thresholds → decisions

The tools changed.
The logic stayed the same.


Why This Still Matters Today?

Large language models feel different.

They generate text.
They reason.
They converse.

But the moment an LLM:

  • Flags content

  • Extracts fields

  • Routes a request

  • Triggers an action

It becomes a classifier again.

Probabilities.
Thresholds.
Trade-offs.
Metrics.

The same old problems, just at a larger scale.


Where This Takes Us Next?

Once we understand how models decide, another layer becomes unavoidable:

How do we evaluate models when there are many classes?
How do representations shape what errors are even possible?
How much trust should we place in confidence scores?

In the next post, we’ll go deeper into representation and reliability:

  • Multiclass evaluation in practice

  • Confusion matrices as diagnostic tools

  • TF-IDF and PMI as early meaning models

  • Embeddings and what they really changed

Before we can trust decisions, we need to understand what the model is actually looking at.

Thanks for reading and learning along with me. See you in the next one!

Comments

Popular posts from this blog

1. Vocabulary, Tokenization & Byte Pair Encoding

6. What Is the Model Actually Looking At?

4. From Predicting Words to Making Decisions