When I run the code below, I see the 'pca.explained_variance_ratio_
' and a histogram, which shows the proportion of variance explained for each feature.
import statsmodels.api as sm
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from statsmodels.stats import anova
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df = pd.DataFrame(mtcars)
x = df.iloc[:,2:]
from sklearn.preprocessing import StandardScaler
pca = PCA(n_components=11)
principalComponents = pca.fit_transform(df)
# Plotting the variances for each PC
PC = range(1, pca.n_components_+1)
plt.bar(PC, pca.explained_variance_ratio_, color='gold')
plt.xlabel('Principal Components')
plt.ylabel('Variance %')
plt.xticks(PC)
How can I map PCA 1 and 2 back to the original features in the data frame?
Each PCs is a linear combination of your variables. For example PC1 will be (where X1,X2 represents each dependent variable):
So w11, w22 would be your loadings and they would represent each features influence on the associated PC. Basically they would show the amount of correlation with your PCs See this post or maybe something like this blog for an explanation with sklearn.
In your example, you did not scale your data before PC, so the vector with the most loading would be the one with the largest magnitude. So let's do this properly:
import statsmodels.api as sm
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df = pd.DataFrame(mtcars)
X = StandardScaler().fit_transform(df)
pca = PCA(n_components=11)
pca.fit(X)
# Plotting the variances for each PC
PC = range(1, pca.n_components_+1)
plt.bar(PC, pca.explained_variance_ratio_, color='gold')
plt.xlabel('Principal Components')
plt.ylabel('Variance %')
plt.xticks(PC)
These are the loadings:
PCnames = ['PC'+str(i+1) for i in range(pca.n_components_)]
Loadings = pd.DataFrame(pca.components_,columns=PCnames,index=df.columns)
Loadings.iloc[:,:2]
PC1 PC2
mpg 0.362531 -0.373916
cyl 0.016124 0.043744
disp -0.225744 -0.175311
hp -0.022540 -0.002592
drat 0.102845 0.058484
wt -0.108797 0.168554
qsec 0.367724 0.057278
vs 0.754091 0.230825
am -0.235702 -0.054035
gear 0.139285 -0.846419
carb 0.124896 0.140695
Roughly this means observations with high vs
will have higher scores on the first component (PC1) and those will high am
will have low scores likewise. And you can check their influence:
Loadings["PC1"].sort_values().plot.barh()