Master Support Vector Machines, one of the most elegant algorithms in machine learning. Learn how SVMs find optimal separating hyperplanes with maximum margin, handle non-separable data with soft margins, and tackle non-linear problems through the powerful kernel trick.
Support Vector Machines represent a beautiful intersection of geometry, optimization, and learning theory. The core idea is elegantly simple: among all hyperplanes that separate two classes, choose the one with the maximum margin to the nearest training points.
This maximum margin principle has deep theoretical justification—it maximizes the geometric separation between classes, leading to better generalization. The points that lie exactly on the margin boundaries are called support vectors, and remarkably, only these points determine the decision boundary.
Real-world data is rarely perfectly separable, so soft margin SVMs allow some violations by introducing slack variables. The regularization parameter C controls the trade-off: high C penalizes misclassifications heavily (narrow margin, potential overfitting), while low C allows more errors (wide margin, potential underfitting).
The true power of SVMs comes from the kernel trick, which enables learning non-linear decision boundaries without explicitly computing high-dimensional feature mappings. By using kernel functions that compute inner products in transformed spaces, SVMs can learn complex patterns while remaining computationally tractable.
This chapter covers:
Click any topic to jump in
The geometric insight — find the separating hyperplane with the widest gap to nearest points.
Two extensions that make SVMs practical
Handle overlapping classes with slack variables and the C parameter for real-world data.
Learn non-linear boundaries by computing inner products in high-dimensional spaces implicitly.
Classification pipelines and regression adaptation
Feature scaling, hyperparameter tuning, multi-class strategies, and when to choose SVMs.
Adapt the SVM margin framework to regression with the epsilon-insensitive loss tube.
SVM finds the hyperplane that maximizes the distance to the nearest training points (support vectors). This margin maximization leads to better generalization.
Key insight: Only support vectors (points on the margin) matter; other points don't affect the decision boundary.
Decision boundary in n-dimensional space. w is the normal vector (perpendicular to hyperplane). b shifts the hyperplane from origin.
The hyperplane partitions into two half-spaces. The normal vector is perpendicular to the hyperplane, and is the signed distance from the origin. The sign of determines which side of the hyperplane a point falls on. In 2D this is a line, in 3D a plane, and in D a -dimensional affine subspace — the geometry is the same regardless of dimensionality.
w=[2,1], b=-3. Is point (2,1) on positive or negative side?
Positive if correctly classified. Can be made arbitrarily large by scaling w. Not meaningful alone.
The functional margin is positive when the point is correctly classified and its magnitude reflects confidence. However, replacing with doubles the functional margin without changing the hyperplane — the margin is not scale-invariant. This is why we need to normalize by to get a geometrically meaningful quantity, or equivalently constrain the optimization so that .
y=+1, wᵀx+b=5. Functional margin? If we double w and b?
Actual distance between hyperplane and nearest points. Scale-invariant. This is what we maximize.
The geometric margin is the perpendicular distance between the two margin boundaries . Maximizing is equivalent to minimizing , transforming a geometric problem into a convex quadratic program. The factor of is a convention that simplifies the gradient to instead of . Larger margins correspond to smoother decision boundaries with better generalization — this is formalized by VC dimension theory.
||w|| = 2. What's the margin width?
Training points exactly on the margin boundary (|wᵀx + b| = 1). Only these affect the solution—other points are irrelevant.
Support vectors are the training points where the constraint is active (equality holds). By the KKT complementarity conditions, non-support vectors have dual variables and contribute nothing to . This sparsity means the solution depends only on the hardest-to-classify points. Removing any non-support vector does not change the optimal hyperplane — the SVM is effectively determined by a small subset of the data.
1000 training points. SVM has 15 support vectors. How many points actually matter?
s.t.
Requires perfectly separable data. Minimizing ||w||² is equivalent to maximizing margin. Quadratic programming problem.
The hard margin optimization s.t. is a convex quadratic program with linear constraints. Strong duality holds (Slater's condition), so the dual gives the same solution. The dual depends only on inner products — this is the foundation for the kernel trick.
Data has one mislabeled point making classes overlap. Hard margin SVM?
Larger margin → more robust to small perturbations → better generalization. Theoretical guarantees from VC dimension theory.
VC theory provides the formal justification: the generalization error is bounded by where is the VC dimension. For a margin- classifier in a ball of radius , , independent of the ambient dimension. Maximizing directly minimizes this bound, explaining why maximum margin classifiers generalize well even in high-dimensional spaces.
Margin A: 0.1 units. Margin B: 2 units. New point is 0.5 units from boundary. Which classifies correctly?
In a trained SVM, you have 3 support vectors out of 100 training points. What happens to the decision boundary if you remove a non-support-vector point?