Posts

6. What Is the Model Actually Looking At?

In the last blog, we talked about how probabilities turn into decisions. Thresholds. Trade-offs. Metrics. But there’s something deeper hiding underneath all of that. Before we evaluate decisions, before we argue about precision vs recall, before we tune thresholds, We need to ask:  What is the model actually seeing? Because models don’t really see text.  They see numbers.  And the way we turn language into numbers determines what kinds of mistakes are even possible. Multiclass Evaluation Gets Messy Fast Binary classification is easy. One positive class. One negative class. A clean 2×2 confusion matrix. But multiclass changes the geometry. Instead of: True positive False positive True negative False negative We now have a k × k confusion matrix. In a native multiclass confusion matrix, classes are not literally re-labeled positive/negative; that’s an interpretation used to compute class-wise metrics. Now, apart from asking:  How often were...

5. Turning Probabilities into Decisions

In the last post, we talked about how NLP systems moved from generating language to making decisions. We saw how classifiers produce probabilities. How loss functions tell us how wrong we are. How optimization nudges models toward better behavior. But there’s a step in this pipeline that often gets looked over. At some point, a probability becomes a decision. And that step is where most failures happen. Probabilities Are Not Decisions A classifier rarely outputs a label directly. It outputs something like:  0.73 That number doesn’t mean “spam.” It means:  “Given the model, the data, and the assumptions baked into training, this input looks spam-like with probability 0.73.” To turn that into a decision, we introduce a threshold . If probability ≥ threshold -> positive class Otherwise -> negative class The most common threshold is 0.5. Because it’s convenient. And that convenience hides trade-offs. Thresholds Encode Values Changing the threshold changes behavior. Lower the...

4. From Predicting Words to Making Decisions

So far, we’ve focused on language models as generators. They read text. They predict what comes next. They assign probabilities to words. But for a long time, most NLP systems didn’t generate language at all . They made decisions. Is this email spam or not? Is this review positive or negative? Does this document belong to topic A or topic B? This shift from predicting words to predicting labels is where early NLP systems spent most of their time. And it’s where ideas like logistic regression , cross-entropy loss , and gradient descent became foundational. From Sequences to Labels In language modeling, the output is a distribution over possible next tokens. In classification, the output is simpler: A class label Or a probability over a small number of classes Instead of asking: “What word comes next?” We ask: “Which category does this input belong to?” At first glance, this sounds easier. Fewer outputs. Clear answers. In practice, it introduces a differen...

3. How Do We Know a Language Model Is Any Good?

So far in this series, we’ve talked about how text becomes tokens and why early language models struggled with data sparsity. Now comes the obvious next question: How do we actually know whether a language model is good or bad? This turns out to be much harder than it sounds. Unlike image classification, where a model either labels an image correctly or it doesn’t, language models don’t usually have one “right” answer. For most inputs, there are many perfectly reasonable continuations. So evaluation in NLP is less about absolute correctness and more about how surprised the model is by real language and how useful it is in practice . Intrinsic vs Extrinsic Evaluation Broadly, there are two ways to evaluate language models. Intrinsic evaluation measures the quality of the language model directly . You ask: How well does the model predict real text? Extrinsic evaluation measures performance on a downstream task. You ask: Does this model help me do something useful, like classifica...