Assume that we are given a matrix . Each row of the matrix is considered to be an observation represented by a data vector that measures features of some phenomenon. We can think of Principal Component Analysis (PCA) as trying to trying to solve two related problems.
- Compression: How do we represent the data matrix succinctly? In other words, is there an efficient representation of that uses less space while not sacrificing too much accuracy?
- Prediction: Which (linear) combinations of the features best explain or influence the data?
Mathematically, the compression problem can be formulated as trying to identify a low rank approximation of that minimizes its Frobenius norm. Assume that the rank of is given by Then for some given we want to find such that is the solution to the optimization problem
As it happens, the solution to the above optimization problem will allow us also to identify the combination of the predictors that best capture the variability in the data. It also happens to have a nice analytical solution.
Low Rank Matrix Approximation Theorem
Represent using its singular value decomposition. Then,
where we assume that the singular values are sorted in descending order. For some define the matrix
Then the low rank matrix approximation theorem tells us that minimizes the squared Frobenius norm among all matrices with rank less than or equal to . Moreover, we know that the following properties hold:
- is unique if and only if .
- The minimum is given by .
Represent the data matrix as where is a column vector that represents the data corresponding to observation where . Then,
where the second equality follows from the fact that the columns of are orthonormal. Looking at the above a bit more closely, we see that the -th row of can be represented as
The above equation gives us precisely the lower dimensional representation of the observation . In fact, the above equation represents the observation as the sum of its projections along the basis vectors , which can be interpreted as the reduced set of features that we use for representing our observations. In short, we represent each observation by its projection onto the subspace spanned by the basis vectors . Since the projections lie in a lower dimensional subspace (), we obtain a compressed representation of the original data.
At the outset we stated that PCA identifies the (linear) combinations of features that best explain or influence the data. Let’s make this notion slightly more precise. Consider a random vector with mean zero that takes values in according to some unknown distribution . Assume that the covariance matrix of exists and is given by . Think of each data vector as a realization of that we get to observe. We have such data vectors which constitutes our data matrix .
Identifying the direction along which varies the most can be formulated as identifying the vector that maximizes . This can be written down as the optimization problem,
The Rayleigh-Ritz Theorem tells us that the maximum value of the above optimization is the largest eigenvalue, say , of the covariance matrix . This value is obtained when is the normalized eigenvector, say , corresponding to the largest eigenvalue. Indeed, we can obtain the second largest eigenvalue by adding the constraint that should be perpendicular to . We can obtain the other eigenvalues and eigenvectors in a similar fashion.
In the singular value decomposition of , the column vectors of correspond to the eigenvectors of , which is an approximation of scaled by . In fact, by the law of large numbers,
Thus the directions that PCA chooses are precisely the directions along which the data is most variable.