Learn to predict discrete categories with machine learning. Master logistic regression for binary classification, understand the sigmoid function and cross-entropy loss, extend to multiclass problems with softmax, and use feature engineering to handle non-linear patterns.
Classification is the task of predicting which category an input belongs to. Unlike regression which outputs continuous values, classification outputs discrete class labels: spam or not spam, cat or dog or bird, malignant or benign.
The foundational algorithm is logistic regression, which despite its name is a classification method. It works by computing a linear combination of features, then passing the result through the sigmoid function to produce a probability between 0 and 1. If this probability exceeds a threshold (typically 0.5), we predict the positive class.
The key to understanding logistic regression is the loss function. We can't use mean squared error because sigmoid creates a non-convex loss surface. Instead, we use cross-entropy (log loss), which heavily penalizes confident wrong predictions and creates a smooth, convex optimization landscape.
The decision boundary in logistic regression is a hyperplane—the set of points where the predicted probability equals 0.5. Points on one side are classified as positive, points on the other as negative. This boundary is linear in the original feature space, but we can create non-linear boundaries through feature engineering.
For problems with more than two classes, we extend to multiclass classification using techniques like one-vs-rest (train K binary classifiers) or softmax regression (directly model probabilities for all K classes).
This chapter covers:
Click any topic to jump in
The core task — separate inputs into two classes using sigmoid probabilities and decision thresholds.
Discriminative vs. generative classifiers for binary problems
The workhorse classifier — maximum likelihood estimation with cross-entropy loss and linear decision boundaries.
A probabilistic approach using Bayes' theorem with conditional independence — fast, interpretable, strong baseline.
Extend binary methods to K classes with one-vs-rest, one-vs-one, or softmax regression.
Handling imbalanced data and engineering better features
Handle skewed class distributions with resampling, class weights, threshold tuning, and proper metrics.
Transform raw inputs into informative features — encoding, scaling, selection, and dimensionality reduction.
Binary classification assigns inputs to one of two classes. The goal is to find a decision boundary that separates the classes.
Key Challenge: We need probabilities, not raw scores. The sigmoid function maps any value to (0, 1).
Regression: continuous output (price, temperature). Classification: discrete categories (yes/no, A/B/C). Classification can output probabilities or hard labels.
Regression models — a continuous conditional expectation. Classification models for discrete — a probability distribution over categories. Using MSE for classification creates a loss surface with flat regions (gradient ) where the sigmoid saturates, stalling gradient descent. Cross-entropy fixes this because has gradient , which remains large even when is near 0 or 1.
Predict house price vs predict if house sells within 30 days. Which is regression, which is classification?
Maps any real number to (0, 1). Output is interpreted as P(y=1|x). Derivative: σ'(z) = σ(z)(1-σ(z)), which is nice for gradient computation.
The sigmoid maps . Its derivative is , which peaks at (value ) and vanishes exponentially as . The sigmoid is the canonical link function for Bernoulli-distributed responses in generalized linear models. It arises naturally from the log-odds: if (a linear function of features), then solving for gives exactly the sigmoid. The inverse is the logit function: .
Calculate σ(0) and σ(2).
σ(0) = 0.5, σ(+∞) → 1, σ(-∞) → 0. Symmetric: σ(-z) = 1 - σ(z). Saturates at extremes (gradients become small).
The symmetry means the sigmoid is symmetric about the point . This implies that if the linear score flips sign, the predicted probability for class 1 and class 0 swap. Saturation at the extremes ( for and for ) means the function is approximately linear only in a narrow band around where . Outside this band, changes in barely affect the output.
If σ(z) = 0.73, what is σ(-z)?
Default threshold t=0.5, but adjust based on costs. Lower t = more positive predictions (higher recall, lower precision).
The decision rule if is equivalent to if (the logit of ). At , the boundary is . Lowering to 0.1 shifts the boundary to , dramatically expanding the positive prediction region. The optimal threshold minimizes expected cost: , where and are the costs of false positives and false negatives respectively.
Probabilities: [0.3, 0.6, 0.45, 0.8]. Predictions at t=0.5? At t=0.4?
The boundary where P(y=1) = P(y=0) = 0.5. A hyperplane in feature space. Points on one side are class 1, other side class 0.
The decision boundary is a hyperplane in with normal vector . The signed distance from any point to this hyperplane is , which is proportional to the log-odds. Points farther from the boundary have higher confidence. The margin (distance between the nearest points of each class and the boundary) determines how robust the classifier is to small perturbations in the input.
Boundary: 2x₁ + 3x₂ - 6 = 0. Is point (1, 2) class 0 or 1?
Use precision-recall tradeoff. Medical diagnosis: lower threshold (catch more disease, accept false alarms). Spam filter: higher threshold (avoid false spam labels).
The ROC curve plots true positive rate against false positive rate as the threshold varies from 0 to 1. The area under this curve (AUC) measures discrimination ability independent of threshold. The precision-recall curve is more informative for imbalanced data: precision and recall directly reflect performance on the minority class. The score summarizes the tradeoff as a single number.
Cancer detection: missing cancer costs 1K. How to set threshold?
A medical test has P(positive|disease) = 0.95 and P(positive|healthy) = 0.05. If 1% of people have the disease, what is P(disease|positive)?