I have a data frame called data_principal_components
with dimensions (306x21154), so 306 observations and 21154 features. Using PCA, I want to project the data into 10 dimensions.
As far as I understand, the following code does this. The resulting matrix (projected
) has a dimension of (306x10).
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
# Sample data:
# Define the dimensions of the DataFrame
num_rows = 306
num_cols = 21154
# Generate random numbers from a normal distribution
data = np.random.randn(num_rows, num_cols)
# Create a DataFrame from the random data
data_principal_components = pd.DataFrame(data)
pca = PCA(10)
projected = pca.fit_transform(data_principal_components)
```
To better understand how the code works, I wanted to reproduce the result of pca.fit_transform()
manually.
Based on my research, I found the following steps:
pc_components = pca.components_ # This gives the eigenvectors
pc_components = pc_components.transpose() # Transposes the eigenvectors, so it has the dimensions (21154x10)
eigenvalues = pca.explained_variance_ # These are the eigenvalues with dimensions (1x10)
Now, as I understand, one can calculate the loadings using the following code based on the formula $\text{loadings} = \text{eigenvectors} \times \sqrt{\text{eigenvalues}}$ :
# Create an empty DataFrame
df = pd.DataFrame()
# Iterate over eigenvalues
for i in range(len(eigenvalues)):
result = np.dot(pc_components[:, i], np.sqrt(eigenvalues[i]))
df[f'Result_{i+1}'] = result # Assign result as a new column in the DataFrame
loadings = df
```
After obtaining the loadings with dimensions (21154x10), I wanted to use them to obtain the projected values with $ \text{Actual values} \times \text{loadings}$ resulting in dimensions (306x21154) $\times$ (21154x10) = (306x10):
test = np.dot(data_principal_components, loadings)
However, when I compare test
to projected
, the values differ substantially. Where am I wrong?
EDIT
I found this way to extract the loadings. However, I still want to derive them semi-manually, can somone help?:
pca = PCA(10) # project from 64 to 2 dimensions
projected = pca.fit_transform(data_principal_components)
loadings = pd.DataFrame(pca.components_.T, columns=['PC1', 'PC2','PC3', 'PC4','PC5', 'PC6','PC7', 'PC8','PC9', 'PC10'], index=data_principal_components.columns)
loadings
You ask for two different things here:
I wanted to reproduce the result of pca.fit_transform()
When you ask about the result of pca.fit_transform()
, you're asking about the principal component (PC) scores.
I found this way to extract the loadings. However, I still want to derive them semi-manually, can somone help?
Here you ask about the loadings.
In a nutshell, using singular value decomposition, you can decompose your data matrix as X = USV.T where
So this allows you to answer your second question: You obtain the loadings by applying a SVD to your data matrix:
data = data - data.mean(0) # don't forget to center your data
U, S, VT = np.linalg.svd(data)
loadings = VT.T # your loadings
Now, the result of pca.fit_transform()
is just the projection of your centered data onto these loadings:
PC_scores = xr.dot(data, loadings[:, :10) # use the first 10 components only