scikit-learnsvddimension-reduction

Why scikit-learn truncatedSVD uses 'randomized' algorithm as default?


I used with truncatedSVD with 30000 by 40000 size of term-document matrix to reducing the dimension to 3000 dimension, when using 'randomized', variance ratio is about 0.5 (n_iter=10) when using 'arpack', variance ratio is about 0.9

Variance ratio of 'randomized' algorithm is lower than one of 'arpack'.

So why scikit-learn truncatedSVD uses 'randomized' algorithm as default?


Solution

  • Speed!

    According to the docs, sklearn.decomposition.TruncatedSVD can use a randomized algorithm due to Halko, Martinson, and Tropp (2009). This paper claims that their algorithm is considerably faster.

    For a dense matrix, it runs in O(m*n*log(k)) time, whereas the classical algorithm takes O(m*n*k) time, where m and n are the dimensions of the matrix from which you want the kth largest components. The randomized algorithm is also easier to efficiently parallelize and makes fewer passes over the data.

    Table 7.1 of the paper (on page 45) shows the performance of a few algorithms as a function of matrix size and # of components, and the randomized algorithm is often an order of magnitude faster.

    The accuracy of the output is also claimed to be pretty good (Figure 7.5), though there are some modifications and constants that might affect it and I haven't gone through the sklearn code to see what they did/did not do.