matlabmachine-learningpca

Principal component analysis


I have to write a classifier (gaussian mixture model) that I use for human action recognition. I have 4 dataset of video. I choose 3 of them as training set and 1 of them as testing set. Before I apply the gmm model on the training set I run the pca on it.

pca_coeff=princomp(trainig_data);
score = training_data * pca_coeff;
training_data = score(:,1:min(size(score,2),numDimension));

During the testing step what should I do? Should I execute a new princomp on testing data

new_pca_coeff=princomp(testing_data);
score = testing_data * new_pca_coeff;
testing_data = score(:,1:min(size(score,2),numDimension));

or I should use the pca_coeff that I compute for the training data?

score = testing_data * pca_coeff;
testing_data = score(:,1:min(size(score,2),numDimension));

Solution

  • The classifier is being trained on data in the space defined by the principal components of the training data. It doesn't make sense to evaluate it in a different space - therefore, you should apply the same transformation to testing data as you did to training data, so don't compute a different pca_coef.

    Incidently, if your testing data is drawn independently from the same distribution as the training data, then for large enough training and test sets, the principal components should be approximately the same.

    One method for choosing how many principal components to use involves examining the eigenvalues from the PCA decomposition. You can get these from the princomp function like this:

    [pca_coeff score eigenvalues] = princomp(data);
    

    The eigenvalues variable will then be an array where each element describes the amount of variance accounted for by the corresponding principal component. If you do:

    plot(eigenvalues);
    

    you should see that the first eigenvalue will be the largest, and they will rapidly decrease (this is called a "Scree Plot", and should look like this: http://www.ats.ucla.edu/stat/SPSS/output/spss_output_pca_5.gif, though your one may have up to 800 points instead of 12).

    Principal components with small corresponding eigenvalues are unlikely to be useful, since the variance of the data in those dimensions is so small. Many people choose a threshold value, and then select all principal components where the eigenvalue is above that threshold. An informal way of picking the threshold is to look at the Scree plot and choose the threshold to be just after the line 'levels out' - in the image I linked earlier, a good value might be ~0.8, selecting 3 or 4 principal components.

    IIRC, you could do something like:

    proportion_of_variance = sum(eigenvalues(1:k)) ./ sum(eigenvalues);
    

    to calculate "the proportion of variance described by the low dimensional data".

    However, since you are using the principal components for a classification task, you can't really be sure that any particular number of PCs is optimal; the variance of a feature doesn't necessarily tell you anything about how useful it will be for classification. An alternative to choosing PCs with the Screen plot is just to try classification with various numbers of principal components and see what the best number is empirically.