pythonscikit-learnsparse-matrixpcaarpack

PCA with arpack returns different values when the order of observations change, but why?


I have recently noticed that when I change the order of the observations in a sparse array, scikit-learn PCA with svd_solver="arpack" returns different floating point numbers. Is this an expected behavior? Thanks in advance! I have an example code:

import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
import scipy.sparse as sp

# Simulate a sparse gene expression array containing two datasets, dataset1 and 2, both containing 5000 observations
n_obs = 10000 # Total number of observations
n_features = 2000 # Total number of features

# Simulate a concatenated dataset with the order of dataset1 + dataset2
rng = np.random.default_rng(42)
X1 = rng.uniform(0, 5, size=(n_obs, n_features)).astype(np.float32)

# Change the order of the datasets so that it is in the form of dataset2 + dataset1
X2 = np.concatenate((X1[5000:,], X1[:5000,]), axis=0)


X1_sparse = sp.csr_matrix(X1)
X2_sparse = sp.csr_matrix(X2)
print("Matrix 1 shape:", X1_sparse.shape, "nnz:", X1_sparse.nnz)
print("Matrix 2 shape:", X2_sparse.shape, "nnz:", X2_sparse.nnz)

# PCA with ARPACK on X1
pca1 = PCA(n_components=20, svd_solver="arpack", random_state=321)
pcs1 = pca1.fit_transform(X1_sparse.toarray())

# PCA with ARPACK on X2
pca2 = PCA(n_components=20, svd_solver="arpack", random_state=321)
pcs2 = pca2.fit_transform(X2_sparse.toarray())

# Wrap in DataFrames with fake cell IDs
obs_names = [f"cell{i}" for i in range(n_obs)]
pcs1 = pd.DataFrame(pcs1, index=obs_names)
pcs2 = pd.DataFrame(pcs2, index=obs_names[5000:] + obs_names[0:5000])
pcs2 = pcs2.loc[pcs1.index,]

# Compare results
print("\nAre components numerically close (1e-6)?",
      np.allclose(np.abs(pcs1), np.abs(pcs2), atol=1e-6))

I was expecting two PCA arrays to be exactly the same. For example: pcs1.iloc[0,0] returns: -2.9909344 while pcs2.iloc[0,0] returns: -2.9909678. These two answers differ by 3.34e-5.

The maximum rounded absolute difference is 0.00677, and the mean absolute difference is 0.0005612017.


Solution

  • In short - yes, this is to be expected, however the differences should be small as they are caused by precision limits of the floating point arithmetic.


    In more detail:

    The sklearn.decomposition.PCA uses mean computation when centering the data:

    mean_ : ndarray of shape (n_features,)

    Per-feature empirical mean, estimated from the training set.

    Equal to X.mean(axis=0).

    Under the hood, this uses a sum of floats to calculate the mean. However, floating point addition is not necessarily associative as explained here. So if you change the order of samples, the compounded rounding error during float summation can lead to a slightly different result.
    On top of that, arpack as an iterative solver, is inherently sensitive to input differences (it can amplify small iterative errors).

    When these two factors are combined, the results can be non-deterministic under higher precision.

    What to do to achieve higher precision for "randomly" ordered data: