[SOLVED] Inconsistent PCA Results with Different Multiprocessing Settings in sklearn

Inconsistent PCA Results with Different Multiprocessing Settings in sklearn

I'm encountering an issue with PCA in sklearn while using multiprocessing. Specifically, the reconstruction error in PCA varies significantly based on the number of processes set in Pool. For instance, using Pool(processes=4) yields a small error (np.abs(tmp_matrix-X_train).max()<1e-2), but increasing to Pool(processes=5) or higher results in a substantial error, with np.abs(tmp_matrix-X_train).max() averaging around 10 for each column. This behavior is observed while using the Intel sklearnex package.

I've tested various combinations and observed the following patterns:

Stable (low error): 20 cpu+processes=1, 80 cpu+processes=1, 80 cpu+processes=4, 120 cpu+processes=5
Unstable (high error): 80 cpu+processes=5, 100 cpu+processes=5, 120 cpu+processes=5(yes, 120+5 is unstable)

Here's the relevant portion of my code:

from sklearnex import patch_sklearn
patch_sklearn()
from sklearn.decomposition import PCA
from functools import partial
from multiprocessing import Pool

def config_selection_single(df_entry: tuple, _some_arguments_indlucding_data_object):
    #some pre-processing code
    for some_iteration_condition:
        # some data processing and transformation to bound data non-NaN and between [-1e20,1e20]
        for another_iteration_condition:
            z_mean = X[train_cond][:].mean()    
            z_std = X[train_cond][:].std()+1e-10
            X_train = (X[train_cond][:]-z_mean) / z_std # X_train has shape ~ 2e4 X 50

            pca = PCA(n_components=20, svd_solver='full')
            p_model = pca.fit(X_train)
            Q = p_model.transform(X_train)
            tmp_matrix = p_model.inverse_transform(Q)
            if not np.allclose(Q,X_train.dot(p_model.components_.transpose())): # to compute reconstruction error.
                print("reconstruction error is huge!")
                print(np.abs(tmp_matrix-X_train).max())
        
config_selection_prtial = partial(config_selection_single, _some_arguments_indlucding_data_object)
with Pool(processes=4) as pool: # 4 is good, 5 and 6 are bad
    pool.map(config_selection_prtial, list(my_df.items()))

Unfortunately I could not find a small dataset demo that could reproduce the issue.

Any insights on why the number of processes affects PCA precision?

Solution

[To answer my own question] It turns out to be a bug from scikit-learn-intelex==2023.1.1. When I udpate it to scikit-learn-intelex==2024.0.1, the result looks good.