Essential mathematics for AI and ML: linear algebra, probability and statistics, information theory, calculus for optimization, matrix decompositions, and dimensionality reduction with PCA.
Mathematics is the foundation of all AI and machine learning. Every algorithm you encounter -- from image filters to neural networks -- relies on concepts from linear algebra, calculus, probability, statistics, and information theory. Understanding these foundations transforms you from someone who applies algorithms to someone who truly understands them.
Linear algebra provides the language: images are matrices of numbers, model parameters are vectors, and predictions are computed through matrix multiplication. When you see Y = XW + b, you are looking at how a neural network layer transforms its inputs. Understanding how matrices transform space -- through rotations, scaling, and projections -- gives you geometric intuition for what models actually do, whether in computer vision or machine learning.
Probability and statistics help you reason about uncertainty. Models don't make perfect predictions -- they estimate probabilities. Understanding distributions, expectations, and Bayes' theorem explains why certain loss functions work and how models quantify confidence. Maximum likelihood estimation connects probability to optimization, forming the basis of most training procedures.
Information theory, developed by Claude Shannon, provides the mathematical framework for measuring information content. Concepts like entropy, cross-entropy, and KL divergence are fundamental to understanding classification loss functions, generative models, and representation learning.
Calculus, specifically gradient-based optimization, is how models learn. The derivative tells us which direction improves predictions, and optimization algorithms follow this gradient to find the best parameters. The chain rule enables backpropagation in neural networks. Understanding convexity helps explain when optimization is easy versus hard.
Matrix decompositions like Singular Value Decomposition (SVD) are workhorses of both CV and ML -- powering image compression, recommender systems, dimensionality reduction, and understanding the geometry of learned representations.
Finally, eigendecomposition underlies dimensionality reduction. Principal Component Analysis (PCA) finds directions of maximum variance by computing eigenvectors of the covariance matrix, while eigenvalues of the Hessian reveal curvature of loss landscapes. In computer vision, eigenvalues power Harris corner detection, and eigenvectors form the basis of Eigenfaces.
This chapter provides the mathematical toolkit you will use throughout your journey:
Click any topic to jump in
Vector arithmetic, dot products, norms, and matrix ops — the atomic units of all ML computation.
Linear transformations, rotation, projection, determinants — how matrices encode geometric operations.
Probabilistic and information-theoretic foundations
Distributions, Bayes' theorem, MLE — the language of uncertainty in ML.
Entropy, cross-entropy, KL divergence — measuring information and defining loss functions.
Gradients, chain rule, Hessians, gradient descent — the engine behind neural network training.
Decompositions and dimensionality reduction
SVD, LU, QR, Cholesky — factoring matrices to solve systems and compress data.
Eigenvectors, spectral theorem, PCA — finding directions of maximum variance for dimensionality reduction.
Vectors are the language of AI. An image pixel location is a vector. A gradient direction is a vector. A deep learning feature embedding is a high-dimensional vector. In ML, your data is stored in matrices, model parameters are vectors, and operations like prediction are matrix multiplications.
In this topic, we explore how vectors represent both position and direction, and learn the operations that let us combine, compare, and transform them. The dot product is especially important -- it measures similarity between vectors, which underlies everything from template matching and attention mechanisms to linear regression predictions. We also cover core matrix operations like multiplication, transpose, rank, and inverse that form the computational backbone of all ML algorithms.
A vector is a quantity with both magnitude (length) and direction. In computer vision, vectors represent:
Vectors can have any number of dimensions. 2D vectors describe points in images, 3D vectors describe points in space, and feature vectors can have hundreds or thousands of dimensions.
Row vs Column Vectors: By convention, we typically use column vectors in linear algebra. A column vector is an matrix, while a row vector is . The transpose operation converts between them.
Geometric vs Algebraic View: Geometrically, a vector is an arrow from origin to a point. Algebraically, it's an ordered list of numbers. Both views are useful—the geometric view helps with intuition, while the algebraic view enables computation.
A vector in is a point in -dimensional space and an arrow from the origin. In ML, a 768-dim BERT embedding and a 2D pixel coordinate are both vectors — the algebra is identical, only the dimension changes.
Express the pixel at position (120, 80) in a 640x480 image as a vector, and find its transpose.
Vector addition combines two vectors component-wise. Geometrically, place the tail of at the tip of —the result points from 's tail to 's tip (the parallelogram rule).
Properties:
Applications:
Adding vectors places them tip-to-tail. In gradient accumulation, each mini-batch contributes a gradient vector; the sum is the total update direction. Commutativity () means the order of accumulation doesn't matter.
A tracked object moves by in frame 1 and in frame 2. Find the total displacement.
The dot product (inner product) measures how much two vectors point in the same direction. It returns a scalar (single number):
Projection Formula: The projection of onto is:
Applications in CV/ML:
The dot product projects one vector onto another. When the vectors are orthogonal and share no information — this is why orthogonal features are ideal in ML: zero redundancy.
Compute the cosine similarity between image embeddings and .
The magnitude or L2 norm (Euclidean norm) is the vector's length. A unit vector has magnitude 1:
Other Important Norms:
Why Normalize? Normalizing vectors is essential when you care about direction but not scale:
The L2 norm is Euclidean distance from the origin. L1 encourages sparsity because its unit ball has corners on the axes, pushing small weights to exactly zero — the geometric reason Lasso produces sparse solutions.
Normalize the gradient vector to a unit vector.
The core operation in ML. Neural network forward pass: Y = X @ W + b. Note that matrix multiplication is associative but NOT commutative: AB ≠ BA generally.
Each element is a dot product between row of and column of . A neural network layer computes every output neuron as a dot product of the input with a learned weight vector.
Multiply A = [[1,2],[3,4]] by B = [[5,6],[7,8]]
Flips rows and columns. Key property: (AB)ᵀ = BᵀAᵀ. Used constantly in gradient calculations and the normal equation.
Transposing swaps rows and columns: . The identity reverses multiplication order — this is exactly what backpropagation does when it propagates gradients backward through layers.
Find the transpose of A = [[1,2,3],[4,5,6]]
Indicates the dimensionality of the column/row space. A rank-deficient matrix has dependent rows/columns. Full rank matrices are invertible.
Rank counts the number of linearly independent rows (or columns). A rank-deficient weight matrix means some neurons compute redundant features. In practice, neural networks often learn low-rank representations, which is why LoRA fine-tuning works.
Find rank of A = [[1,2],[2,4]]
Only square, full-rank matrices are invertible. Used in the closed-form solution to linear regression: w = (XᵀX)⁻¹Xᵀy. Pseudo-inverse handles non-invertible cases.
If then , but only when is full rank. The normal equation gives closed-form linear regression — but fails when features are collinear ( is singular).
Find inverse of A = [[2,1],[5,3]]
Given vectors a = (3, 4) and b = (1, 2), calculate: (a) a + b, (b) a · b, (c) the angle between them.