pythonmachine-learningscikit-learnpca

How to map the results of Principal Component Analysis back to the actual features that were fed into the model?


When I run the code below, I see the 'pca.explained_variance_ratio_' and a histogram, which shows the proportion of variance explained for each feature.

import statsmodels.api as sm
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from statsmodels.stats import anova

mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df = pd.DataFrame(mtcars)

x = df.iloc[:,2:]

from sklearn.preprocessing import StandardScaler

pca = PCA(n_components=11)
principalComponents = pca.fit_transform(df)


# Plotting the variances for each PC
PC = range(1, pca.n_components_+1)
plt.bar(PC, pca.explained_variance_ratio_, color='gold')
plt.xlabel('Principal Components')
plt.ylabel('Variance %')
plt.xticks(PC)

enter image description here

How can I map PCA 1 and 2 back to the original features in the data frame?


Solution

  • Each PCs is a linear combination of your variables. For example PC1 will be (where X1,X2 represents each dependent variable):

    enter image description here

    So w11, w22 would be your loadings and they would represent each features influence on the associated PC. Basically they would show the amount of correlation with your PCs See this post or maybe something like this blog for an explanation with sklearn.

    In your example, you did not scale your data before PC, so the vector with the most loading would be the one with the largest magnitude. So let's do this properly:

    import statsmodels.api as sm
    import numpy as np
    import pandas as pd
    from sklearn.decomposition import PCA
    from sklearn.preprocessing import StandardScaler
    import matplotlib.pyplot as plt
    
    mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
    df = pd.DataFrame(mtcars)
    X = StandardScaler().fit_transform(df)
    pca = PCA(n_components=11)
    pca.fit(X)
    
    # Plotting the variances for each PC
    PC = range(1, pca.n_components_+1)
    plt.bar(PC, pca.explained_variance_ratio_, color='gold')
    plt.xlabel('Principal Components')
    plt.ylabel('Variance %')
    plt.xticks(PC)
    

    enter image description here

    These are the loadings:

    PCnames = ['PC'+str(i+1) for i in range(pca.n_components_)]
    Loadings = pd.DataFrame(pca.components_,columns=PCnames,index=df.columns)
    
    Loadings.iloc[:,:2]
    
                    PC1        PC2
        mpg      0.362531   -0.373916
        cyl      0.016124    0.043744
        disp    -0.225744   -0.175311
        hp      -0.022540   -0.002592
        drat     0.102845    0.058484
        wt      -0.108797    0.168554
        qsec     0.367724    0.057278
        vs       0.754091    0.230825
        am      -0.235702   -0.054035
        gear     0.139285   -0.846419
        carb     0.124896    0.140695 
    

    Roughly this means observations with high vs will have higher scores on the first component (PC1) and those will high am will have low scores likewise. And you can check their influence:

    Loadings["PC1"].sort_values().plot.barh()
    

    enter image description here