pythonmachine-learningpca

PCA Explained Variance Analysis


I'm very new to PCA. I have 11 X variables for my model. These are the X variable labels

x = ['Day','Month', 'Year', 'Rolling Average','Holiday Effect', 'Day of the Week', 'Week of the Year', 'Weekend Effect', 'Last Day of the Month', "Quarter" ]

This is the graph I generated from the explained variance. With the x axis being the principal component. enter image description here

[  3.47567089e-01   1.72406623e-01   1.68663799e-01   8.86739892e-02
   4.06427375e-02   2.75054035e-02   2.26578769e-02   5.72892368e-03
   2.49272688e-03   6.37160140e-05]

I need to know whether I have a good selection of features. And how can I know which feature contributions the most.

from sklearn import decomposition
pca = decomposition.PCA()
pca.fit(X_norm)
scores = pca.explained_variance_

Solution

  • Though I do NOT know the dataset, I recommend that you scale your features before using PCA (variance will be maximized along the axes). I think X_norm refers to that in your code.

    By using PCA, we are targeting to reduce dimensionality. In order to do that, we will start with a feature space which includes all X variables in your case, and will end up a projection of that space which typically is a different feature (sub)space.

    In practice, when you have correlations between features, PCA can help you to project that correlation to smaller dimensions.

    Think about this, if I'm holding a paper on my desk with full of dots on it, do I need the 3rd dimension to represent that dataset? Probably not, since all the dots are on paper and could be represented in 2D space.

    When you are trying to decide how many principal components you will use from your new feature space, you can look at explained variance and it will tell you how much information is there for each principal component.

    When I look at the principal components in your data, I see that ~85% of the variance could be attributed to first 6 principal components.

    You can also set n_components. For example if you use n_components=2, then your transformed dataset will have 2 features.