4. From Predicting Words to Making Decisions
So far, we’ve focused on language models as generators.
They read text.
They predict what comes next.
They assign probabilities to words.
But for a long time, most NLP systems didn’t generate language at all.
They made decisions.
Is this email spam or not?
Is this review positive or negative?
Does this document belong to topic A or topic B?
This shift from predicting words to predicting labels is where early NLP systems spent most of their time. And it’s where ideas like logistic regression, cross-entropy loss, and gradient descent became foundational.
From Sequences to Labels
In language modeling, the output is a distribution over possible next tokens.
In classification, the output is simpler:
-
A class label
-
Or a probability over a small number of classes
Instead of asking:
“What word comes next?”
We ask:
“Which category does this input belong to?”
At first glance, this sounds easier. Fewer outputs. Clear answers.
In practice, it introduces a different set of challenges.
Binary Decisions and Probabilistic Thinking
Consider the simplest case: binary classification.
Spam or not spam
Positive or negative
Relevant or irrelevant
We don’t just want a hard yes or no.
We want a probability.
A system that says:
“This email is spam with 51% confidence”
behaves very differently from one that says:
“This email is spam with 99% confidence”
Even though both output the same label.
This is why early NLP leaned so heavily on probabilistic models. They let us:
-
Rank results
-
Set thresholds
-
Trade off precision and recall
-
Make decisions under uncertainty
Why Logistic Regression?
Logistic regression became the workhorse of early NLP because it was exactly what the problem needed.
It:
-
Turns a weighted sum of features into a probability
-
Produces outputs between 0 and 1
-
Is interpretable
-
Is fast and stable to train
-
Plays nicely with sparse text features
Under the hood, it’s doing something simple but very powerful:
-
Score the input
-
Pass that score through a sigmoid
-
Interpret the result as a probability
That single design choice, modeling probabilities instead of decisions, made it incredibly useful.
Learning Means Minimizing Loss
Once you frame classification probabilistically, a natural question follows:
How wrong is the model?
This is where loss functions come in.
Rather than counting mistakes, we measure:
-
How much probability the model assigns to the correct answer
If the model is confident and right -> low loss
If the model is confident and wrong -> high loss
Cross-entropy loss captures this intuition cleanly.
It punishes confident mistakes far more than uncertain ones.
That turns learning into an optimization problem:
Adjust the weights to minimize average loss on the training data.
Gradient Descent: How Models Actually Learn
To minimize loss, we need to know:
-
Which direction to move the weights
-
How big a step to take
This is what gradients give us.
Gradient descent and its stochastic variant, SGD, let the model:
-
Look at an example
-
Measure how wrong it is
-
Nudge the weights in the direction that would reduce that error
Over many examples and many small updates, useful patterns start to emerge.
It's just repeated, guided correction.
Evaluation Comes Back Into the Picture
Once we’re making decisions, evaluation looks different again.
Accuracy alone isn’t enough.
Especially when classes are imbalanced.
This is why metrics like:
-
Precision
-
Recall
-
F1 score
-
Confusion matrices
-
Micro vs. macro averaging
become central in classification tasks.
They force us to ask:
What kinds of mistakes do we care about?
And the answer depends entirely on the problem.
-
Language models predict probabilities
-
Classifiers turn probabilities into decisions
-
Loss functions measure how wrong those decisions are
-
Optimization tells us how to improve
Next up, we’ll dig deeper into this decision-making side of NLP:
-
Text classification in practice
-
Binary vs. multiclass models
-
Probabilistic outputs and thresholding
-
Precision, recall, and why accuracy can mislead
-
How logistic regression shaped early NLP systems and what replaced it
Once we understand how models decide, we can start asking better questions about fairness, calibration, and real-world reliability.
Thanks for reading and learning along with me. See you in the next one!
Comments
Post a Comment