pythonmachine-learningdimensionality-reduction

Speeding up UMAP


I have a situation similar to the one that was discussed in an old thread where the number of features was 1.2M (mine 10M) but only hundreds of observations. Among metrics I tried, euclidean performed poorly but cosine and correlation were much better. I also noticed that it was only at the end, and that too for barely a few seconds, >100% CPU was being used while my system has 256 cores. For the most part, only a single core was being used, presumably for the metric computation. While I would have preferred UMAP scaling well, but I tried addressing the issue through NumPy (which I believe can use multiple cores for computation).

I tried the following approach:

Original function:

import umap
from sklearn.preprocessing import StandardScaler

metric = 'cosine' # alternatively 'correlation'
scaler = StandardScaler()
ip_std = scaler.fit_transform(ip_mat)

# Start UMAP
reducer = umap.UMAP(n_components=n_components, n_neighbors=n_neighbors, metric=metric)
umap_embed = reducer.fit_transform(ip_std)

Modified version:

import umap
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine
from umap.umap_ import nearest_neighbors

# Start precomputed_knn

scaler = StandardScaler()
eiip_std = scaler.fit_transform(eiip_mat)

dist_cosine = 1 - pairwise_distances(eiip_std, metric="cosine")
precomputed_knn = nearest_neighbors(dist_cosine, metric="cosine", \
                                      metric_kwds=None, angular=False, \
                                      n_neighbors=n_neighbors, random_state=42)
# Start UMAP
reducer = umap.UMAP(n_components=n_components, precomputed_knn=precomputed_knn)
umap_embed = reducer.fit_transform(eiip_std)

return umap_embed

While I got no errors, the output was not of cosine at all but of the default euclidian.

Could you please point to the mistake in the above code and suggest any improvements?

Thanks in advance


Solution

  • I think you need to set the metric as 'cosine' so the reducer in your code becomes:

    reducer = umap.UMAP(n_components=n_components, metric='cosine', precomputed_knn=precomputed_knn)