pythonscikit-learnsvd

Why Sklearn TruncatedSVD's explained variance ratios are not in descending order?


Why Sklearn.decomposition.TruncatedSVD's explained variance ratios are not ordered by singular values?

My code is below:

X = np.array([[1,1,1,1,0,0,0,0,0,0,0,0,0,0],
           [0,0,1,1,1,1,1,1,1,0,0,0,0,0],
           [0,0,0,0,0,0,1,1,1,1,1,1,0,0],
           [0,0,0,0,0,0,0,0,0,0,1,1,1,1]])
svd = TruncatedSVD(n_components=4)
svd.fit(X4)
print(svd.explained_variance_ratio_)
print(svd.singular_values_)

and the results:

[0.17693405 0.46600983 0.21738089 0.13967523]
[3.1918354  2.39740372 1.83127499 1.30808033]

I heard that a singular value means how much the component can explain data, so I think explained variance ratios also are followed by the order of singular values. But the ratios are not ordered by descending order.

Why does this happen?


Solution

  • I heard that a singular value means how much the component can explain data

    This holds for PCA, but it is not exactly true for (truncated) SVD; quoting from a relevant Github thread back in the day when an explained_variance_ratio_ attribute was not even available for TruncatedSVD (2014 - emphasis mine):

    preserving the variance is not the exact objective function of truncated SVD without centering

    So, the singular values themselves are indeed sorted in descending order, but this does not hold necessarily for the corresponding explained variance ratios if the data are not centered.

    But if we do center the data before, then the explained variance ratios come out sorted in descending order indeed, in correspondence with the singular values themselves:

    from sklearn.preprocessing import StandardScaler
    from sklearn.decomposition import TruncatedSVD
    
    sc = StandardScaler()
    Xs = sc.fit_transform(X) # X data from the question here
    
    svd = TruncatedSVD(n_components=4)
    svd.fit(Xs)
    
    print(svd.explained_variance_ratio_)
    print(svd.singular_values_)
    

    Result:

    [4.60479851e-01 3.77856541e-01 1.61663608e-01 8.13905807e-66]
    [5.07807756e+00 4.59999633e+00 3.00884730e+00 8.21430014e-17]
    

    For the mathematical & computational differences between centered and non-centered data in PCA & SVD calculations, see How does centering make a difference in PCA (for SVD and eigen decomposition)?


    Regarding the use of TruncatedSVD itself, here is user ogrisel (scikit-learn contributor) in a relevant answer in Difference between scikit-learn implementations of PCA and TruncatedSVD:

    In practice TruncatedSVD is useful on large sparse datasets which cannot be centered without making the memory usage explode.

    So, it's not exactly clear why you have selected to use TruncatedSVD here, but, if you don't have a too-large dataset that causes memory issues, I guess you should revert to PCA instead.