machine-learningpca

Simple explanation of PCA to reduce dimensionality of dataset


I know that PCA does not tell you which features of a dataset are the most significant, but which combinations of features keep the most variance.

How could you use the fact that PCA rotates the dataset in such a way that it has the most variance along the first dimension, second most along second, and so on to reduce the dimensionality of the dataset?

I mean, more in depth, How are the first N eigenvectors used to transform the feature vectors into a lower-dimensional representation that keeps most of the variance?


Solution

  • Let X be an N x d matrix where each row X_{n,:} is a vector from the dataset.

    Then X'X is the covariance matrix and an eigen decomposition gives X'X=UDU' where U is a d x d matrix of eigenvectors with U'U=I and D is a d x d diagonal matrix of eigenvalues.

    The form of the eigendecomposition means that U'X'XU=U'UDU'U=D which means that if you transform your dataset by U then the new dataset, XU, will have a diagonal covariance matrix.

    If the eigenvalues are ordered from largest to smallest, this also means that the average squared value of the first transformed feature (given by the expression U_1'X'XU_1=\sum_n (\sum_d U_{1,d} X_{n,d})^2) will be larger that the second, the second larger than the third, etc.

    If we order the features of a dataset from largest to smallest average value, then if we just get rid of the features with small average values (and the relative sizes of the large average values are much larger than the small ones), then we haven't lost much information. That is the concept.