4. From Predicting Words to Making Decisions

So far, we’ve focused on language models as generators.

They read text.
They predict what comes next.
They assign probabilities to words.

But for a long time, most NLP systems didn’t generate language at all.

They made decisions.

Is this email spam or not?
Is this review positive or negative?
Does this document belong to topic A or topic B?

This shift from predicting words to predicting labels is where early NLP systems spent most of their time. And it’s where ideas like logistic regression, cross-entropy loss, and gradient descent became foundational.

From Sequences to Labels

In language modeling, the output is a distribution over possible next tokens.

In classification, the output is simpler:

A class label
Or a probability over a small number of classes

Instead of asking:

“What word comes next?”

We ask:

“Which category does this input belong to?”

At first glance, this sounds easier. Fewer outputs. Clear answers.

In practice, it introduces a different set of challenges.

Binary Decisions and Probabilistic Thinking

Consider the simplest case: binary classification.

Spam or not spam
Positive or negative
Relevant or irrelevant

We don’t just want a hard yes or no.
We want a probability.

A system that says:

“This email is spam with 51% confidence”

behaves very differently from one that says:

“This email is spam with 99% confidence”

Even though both output the same label.

This is why early NLP leaned so heavily on probabilistic models. They let us:

Rank results
Set thresholds
Trade off precision and recall
Make decisions under uncertainty

Why Logistic Regression?

Logistic regression became the workhorse of early NLP because it was exactly what the problem needed.

It:

Turns a weighted sum of features into a probability
Produces outputs between 0 and 1
Is interpretable
Is fast and stable to train
Plays nicely with sparse text features

Under the hood, it’s doing something simple but very powerful:

Score the input
Pass that score through a sigmoid
Interpret the result as a probability

That single design choice, modeling probabilities instead of decisions, made it incredibly useful.

Learning Means Minimizing Loss

Once you frame classification probabilistically, a natural question follows:

How wrong is the model?

This is where loss functions come in.

Rather than counting mistakes, we measure:

How much probability the model assigns to the correct answer

If the model is confident and right -> low loss
If the model is confident and wrong -> high loss

Cross-entropy loss captures this intuition cleanly.
It punishes confident mistakes far more than uncertain ones.

That turns learning into an optimization problem:

Adjust the weights to minimize average loss on the training data.

Gradient Descent: How Models Actually Learn

To minimize loss, we need to know:

Which direction to move the weights
How big a step to take

This is what gradients give us.

Gradient descent and its stochastic variant, SGD, let the model:

Look at an example
Measure how wrong it is
Nudge the weights in the direction that would reduce that error

Over many examples and many small updates, useful patterns start to emerge.

It's just repeated, guided correction.

Evaluation Comes Back Into the Picture

Once we’re making decisions, evaluation looks different again.

Accuracy alone isn’t enough.
Especially when classes are imbalanced.

This is why metrics like:

Precision
Recall
F1 score
Confusion matrices
Micro vs. macro averaging

become central in classification tasks.

They force us to ask:

What kinds of mistakes do we care about?

And the answer depends entirely on the problem.

At this point, we’ve connected the dots:

Language models predict probabilities
Classifiers turn probabilities into decisions
Loss functions measure how wrong those decisions are
Optimization tells us how to improve

Next up, we’ll dig deeper into this decision-making side of NLP:

Text classification in practice
Binary vs. multiclass models
Probabilistic outputs and thresholding
Precision, recall, and why accuracy can mislead
How logistic regression shaped early NLP systems and what replaced it

Once we understand how models decide, we can start asking better questions about fairness, calibration, and real-world reliability.

Thanks for reading and learning along with me. See you in the next one!

Search This Blog

NLP 101