Week 5-6

Chapter 3: Classification

Learn to predict discrete categories with machine learning. Master logistic regression for binary classification, understand the sigmoid function and cross-entropy loss, extend to multiclass problems with softmax, and use feature engineering to handle non-linear patterns.

Chapter Overview

Classification is the task of predicting which category an input belongs to. Unlike regression which outputs continuous values, classification outputs discrete class labels: spam or not spam, cat or dog or bird, malignant or benign.

The foundational algorithm is logistic regression, which despite its name is a classification method. It works by computing a linear combination of features, then passing the result through the sigmoid function to produce a probability between 0 and 1. If this probability exceeds a threshold (typically 0.5), we predict the positive class.

The key to understanding logistic regression is the loss function. We can't use mean squared error because sigmoid creates a non-convex loss surface. Instead, we use cross-entropy (log loss), which heavily penalizes confident wrong predictions and creates a smooth, convex optimization landscape.

The decision boundary in logistic regression is a hyperplane—the set of points where the predicted probability equals 0.5. Points on one side are classified as positive, points on the other as negative. This boundary is linear in the original feature space, but we can create non-linear boundaries through feature engineering.

For problems with more than two classes, we extend to multiclass classification using techniques like one-vs-rest (train K binary classifiers) or softmax regression (directly model probabilities for all K classes).

This chapter covers:

Binary Classification: The sigmoid function, decision thresholds, and linear decision boundaries
Logistic Regression: Maximum likelihood, cross-entropy loss, and why it works
Multiclass: One-vs-rest, one-vs-one, and softmax approaches
Naive Bayes: A probabilistic classifier based on Bayes' theorem
Class Imbalance: Handling datasets where classes have very different frequencies
Feature Engineering: Polynomial features and transformations to handle non-linear patterns

Chapter Roadmap

Click any topic to jump in

Binary Classification

The core task — separate inputs into two classes using sigmoid probabilities and decision thresholds.

Classification vs RegressionSigmoid FunctionSigmoid PropertiesDecision ThresholdLinear Decision BoundaryThreshold Tuning

Two algorithmic approaches

Discriminative vs. generative classifiers for binary problems

Logistic Regression

The workhorse classifier — maximum likelihood estimation with cross-entropy loss and linear decision boundaries.

Logistic ModelCross-Entropy LossMLE ConnectionGradient for Logistic RegressionOdds and Log-OddsRegularized Logistic RegressionWeight Interpretation

Naive Bayes

A probabilistic approach using Bayes' theorem with conditional independence — fast, interpretable, strong baseline.

Bayes' TheoremNaive Independence AssumptionClassification RuleGaussian Naive BayesMultinomial Naive BayesLaplace Smoothing

Beyond two classes

Multiclass Classification

Extend binary methods to K classes with one-vs-rest, one-vs-one, or softmax regression.

One-vs-Rest (OvR)One-vs-One (OvO)Softmax (Multinomial)Softmax PropertiesCross-Entropy (Multiclass)Multiclass vs Multilabel

Real-world challenges

Handling imbalanced data and engineering better features

Class Imbalance

Handle skewed class distributions with resampling, class weights, threshold tuning, and proper metrics.

The ProblemResampling: OversamplingResampling: UndersamplingClass WeightsThreshold AdjustmentMetrics for Imbalanced Data

Feature Engineering

Transform raw inputs into informative features — encoding, scaling, selection, and dimensionality reduction.

Polynomial FeaturesFeature ScalingOne-Hot EncodingLabel EncodingTarget EncodingFeature Selection: Filter MethodsFeature Selection: Wrapper MethodsDimensionality Reduction

Binary classification assigns inputs to one of two classes. The goal is to find a decision boundary that separates the classes.

Key Challenge: We need probabilities, not raw scores. The sigmoid function maps any value to (0, 1).

In this topic

1Classification vs Regression

2Sigmoid Function

3Sigmoid Properties

4Decision Threshold

5Linear Decision Boundary

6Threshold Tuning

1 of 6

Classification vs Regression

Regression: continuous output (price, temperature). Classification: discrete categories (yes/no, A/B/C). Classification can output probabilities or hard labels.

Mathematical Intuition

Regression models $E[y|\mathbf{x}] \in \mathbb{R}$ — a continuous conditional expectation. Classification models $P(y = k | \mathbf{x})$ for discrete $k \in \{0, 1, \ldots, K-1\}$ — a probability distribution over categories. Using MSE for classification creates a loss surface with flat regions (gradient $\approx 0$ ) where the sigmoid saturates, stalling gradient descent. Cross-entropy fixes this because $-\log(\sigma(z))$ has gradient $\sigma(z) - 1$ , which remains large even when $\sigma(z)$ is near 0 or 1.

Example:

Predict house price vs predict if house sells within 30 days. Which is regression, which is classification?

2 of 6

Sigmoid Function

$\sigma(z) = \frac{1}{1 + e^{-z}}$

Maps any real number to (0, 1). Output is interpreted as P(y=1|x). Derivative: σ'(z) = σ(z)(1-σ(z)), which is nice for gradient computation.

Mathematical Intuition

The sigmoid $\sigma(z) = \frac{1}{1 + e^{-z}}$ maps $\mathbb{R} \to (0, 1)$ . Its derivative is $\sigma'(z) = \sigma(z)(1 - \sigma(z))$ , which peaks at $z = 0$ (value $0.25$ ) and vanishes exponentially as $|z| \to \infty$ . The sigmoid is the canonical link function for Bernoulli-distributed responses in generalized linear models. It arises naturally from the log-odds: if $\log \frac{P(y=1)}{P(y=0)} = z$ (a linear function of features), then solving for $P(y=1)$ gives exactly the sigmoid. The inverse is the logit function: $\sigma^{-1}(p) = \log(p/(1-p))$ .

Example:

Calculate σ(0) and σ(2).

3 of 6

Sigmoid Properties

σ(0) = 0.5, σ(+∞) → 1, σ(-∞) → 0. Symmetric: σ(-z) = 1 - σ(z). Saturates at extremes (gradients become small).

Mathematical Intuition

The symmetry $\sigma(-z) = 1 - \sigma(z)$ means the sigmoid is symmetric about the point $(0, 0.5)$ . This implies that if the linear score $z$ flips sign, the predicted probability for class 1 and class 0 swap. Saturation at the extremes ( $\sigma(z) \approx 0$ for $z \ll 0$ and $\sigma(z) \approx 1$ for $z \gg 0$ ) means the function is approximately linear only in a narrow band around $z = 0$ where $\sigma(z) \approx 0.5 + 0.25z$ . Outside this band, changes in $z$ barely affect the output.

Example:

If σ(z) = 0.73, what is σ(-z)?

4 of 6

Decision Threshold

$\hat{y} = 1 \text{ if } \sigma(z) \geq t, \text{ else } 0$

Default threshold t=0.5, but adjust based on costs. Lower t = more positive predictions (higher recall, lower precision).

Mathematical Intuition

The decision rule $\hat{y} = 1$ if $\sigma(z) \geq t$ is equivalent to $\hat{y} = 1$ if $z \geq \log\frac{t}{1-t}$ (the logit of $t$ ). At $t = 0.5$ , the boundary is $z = 0$ . Lowering $t$ to 0.1 shifts the boundary to $z = \log(1/9) \approx -2.2$ , dramatically expanding the positive prediction region. The optimal threshold minimizes expected cost: $t^* = \frac{C_{FP}}{C_{FP} + C_{FN}}$ , where $C_{FP}$ and $C_{FN}$ are the costs of false positives and false negatives respectively.

Example:

Probabilities: [0.3, 0.6, 0.45, 0.8]. Predictions at t=0.5? At t=0.4?

5 of 6

Linear Decision Boundary

$z = \mathbf{w}^T \mathbf{x} + b = 0$

The boundary where P(y=1) = P(y=0) = 0.5. A hyperplane in feature space. Points on one side are class 1, other side class 0.

Mathematical Intuition

The decision boundary $\{\mathbf{x} : \mathbf{w}^T\mathbf{x} + b = 0\}$ is a hyperplane in $\mathbb{R}^p$ with normal vector $\mathbf{w}$ . The signed distance from any point $\mathbf{x}_0$ to this hyperplane is $\frac{\mathbf{w}^T \mathbf{x}_0 + b}{\|\mathbf{w}\|}$ , which is proportional to the log-odds. Points farther from the boundary have higher confidence. The margin (distance between the nearest points of each class and the boundary) determines how robust the classifier is to small perturbations in the input.

Example:

Boundary: 2x₁ + 3x₂ - 6 = 0. Is point (1, 2) class 0 or 1?

6 of 6

Threshold Tuning

Use precision-recall tradeoff. Medical diagnosis: lower threshold (catch more disease, accept false alarms). Spam filter: higher threshold (avoid false spam labels).

Mathematical Intuition

The ROC curve plots true positive rate $TPR = TP/(TP+FN)$ against false positive rate $FPR = FP/(FP+TN)$ as the threshold varies from 0 to 1. The area under this curve (AUC) measures discrimination ability independent of threshold. The precision-recall curve is more informative for imbalanced data: precision $= TP/(TP+FP)$ and recall $= TP/(TP+FN)$ directly reflect performance on the minority class. The $F_1$ score $= 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}$ summarizes the tradeoff as a single number.

Example:

Cancer detection: missing cancer costs $1M, false alarm costs$ 1K. How to set threshold?

Theory Exercise

Problem:

A medical test has P(positive|disease) = 0.95 and P(positive|healthy) = 0.05. If 1% of people have the disease, what is P(disease|positive)?

Hints:

This is Bayes' theorem
P(disease) = 0.01, P(healthy) = 0.99
Calculate P(positive) using total probability

Blogs

Scikit-Learn: Classification StatQuest: Classification Fundamentals Wikipedia: Binary Classification

Linear Regression

Model Evaluation