python scikit-learn ram cosine-similarity

Can I force sklearn to use float32 instead of float64?

I am building a product recommender that will use the description of products to find similar products and recommend them. I am using CountVectorizer over the description to find semantically similar descriptions, rank them and suggest those similar.

The problem comes when calculating the cosine similarity matrix. My initial dataframe has 47,046 rows so Im coming up with RAM issues both on my local pc and in my Colab notebook.

Checking the count matrix that CountVectorizer I see that it outputs it as int64:

<47046x3607 sparse matrix of type '<class 'numpy.int64'>'
    with 699336 stored elements in Compressed Sparse Row format>

There is no issue in casting it to int32 with : count_matrix = count_matrix.astype(np.int32) but still when running the cosinesimilarity from sklearn it outputs float64 instead of float32 (I confirmed this by testing with a smaller dataset that can be processed fine).

Is there any way to force the use of float32? Or a way to actually solve the high RAM usage with matrices altogether?

Solution

Is there any way to force the use of float32?

You could cast the input sparse matrix to float32. In my testing, this causes the output array to be float32.

Here's a test program I wrote.

import scipy
import sklearn.metrics
import numpy as np

rvs = scipy.stats.randint(low=0, high=10)

A = scipy.sparse.random(47046, 3607, density=0.0005, data_rvs=rvs.rvs, dtype=np.int64)
print("starting dtype", A.dtype)
print("output dtype", sklearn.metrics.pairwise.cosine_similarity(A, A).dtype)
A = A.astype(np.float32)
print("starting dtype", A.dtype)
print("output dtype", sklearn.metrics.pairwise.cosine_similarity(A, A).dtype)

Output:

starting dtype int64
output dtype float64
starting dtype float32
output dtype float32