pythonscikit-learnpca

PCA in Python: Reproducing pca.fit_transform() results using pca.fit()?


I have a data frame called data_principal_components with dimensions (306x21154), so 306 observations and 21154 features. Using PCA, I want to project the data into 10 dimensions.

As far as I understand, the following code does this. The resulting matrix (projected) has a dimension of (306x10).

import numpy as np
import pandas as pd
from sklearn.decomposition import PCA

# Sample data:
# Define the dimensions of the DataFrame
num_rows = 306
num_cols = 21154

# Generate random numbers from a normal distribution
data = np.random.randn(num_rows, num_cols)

# Create a DataFrame from the random data
data_principal_components = pd.DataFrame(data)

pca = PCA(10)  
projected = pca.fit_transform(data_principal_components)
```

To better understand how the code works, I wanted to reproduce the result of pca.fit_transform() manually.

Based on my research, I found the following steps:

pc_components = pca.components_  # This gives the eigenvectors
pc_components = pc_components.transpose()  # Transposes the eigenvectors, so it has the dimensions (21154x10)
eigenvalues = pca.explained_variance_  # These are the eigenvalues with dimensions (1x10)

Now, as I understand, one can calculate the loadings using the following code based on the formula $\text{loadings} = \text{eigenvectors} \times \sqrt{\text{eigenvalues}}$ :

# Create an empty DataFrame
df = pd.DataFrame()

# Iterate over eigenvalues
for i in range(len(eigenvalues)):
    result = np.dot(pc_components[:, i], np.sqrt(eigenvalues[i]))
    df[f'Result_{i+1}'] = result  # Assign result as a new column in the DataFrame
    
loadings = df
```

After obtaining the loadings with dimensions (21154x10), I wanted to use them to obtain the projected values with $ \text{Actual values} \times \text{loadings}$ resulting in dimensions (306x21154) $\times$ (21154x10) = (306x10):

test = np.dot(data_principal_components, loadings)

However, when I compare test to projected, the values differ substantially. Where am I wrong?

EDIT

I found this way to extract the loadings. However, I still want to derive them semi-manually, can somone help?:

pca = PCA(10)  # project from 64 to 2 dimensions
projected = pca.fit_transform(data_principal_components)

loadings = pd.DataFrame(pca.components_.T, columns=['PC1', 'PC2','PC3', 'PC4','PC5', 'PC6','PC7', 'PC8','PC9', 'PC10'], index=data_principal_components.columns)
loadings

Solution

  • You ask for two different things here:

    I wanted to reproduce the result of pca.fit_transform()

    When you ask about the result of pca.fit_transform(), you're asking about the principal component (PC) scores.

    I found this way to extract the loadings. However, I still want to derive them semi-manually, can somone help?

    Here you ask about the loadings.

    In a nutshell, using singular value decomposition, you can decompose your data matrix as X = USV.T where

    So this allows you to answer your second question: You obtain the loadings by applying a SVD to your data matrix:

    data = data - data.mean(0)  # don't forget to center your data
    U, S, VT = np.linalg.svd(data)
    loadings = VT.T  # your loadings
    

    Now, the result of pca.fit_transform() is just the projection of your centered data onto these loadings:

    PC_scores = xr.dot(data, loadings[:, :10)  # use the first 10 components only