I have a situation similar to the one that was discussed in an old thread where the number of features was 1.2M (mine 10M) but only hundreds of observations. Among metric
s I tried, euclidean
performed poorly but cosine
and correlation
were much better. I also noticed that it was only at the end, and that too for barely a few seconds, >100% CPU was being used while my system has 256 cores. For the most part, only a single core was being used, presumably for the metric computation. While I would have preferred UMAP scaling well, but I tried addressing the issue through NumPy (which I believe can use multiple cores for computation).
I tried the following approach:
Original function:
import umap
from sklearn.preprocessing import StandardScaler
metric = 'cosine' # alternatively 'correlation'
scaler = StandardScaler()
ip_std = scaler.fit_transform(ip_mat)
# Start UMAP
reducer = umap.UMAP(n_components=n_components, n_neighbors=n_neighbors, metric=metric)
umap_embed = reducer.fit_transform(ip_std)
Modified version:
import umap
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine
from umap.umap_ import nearest_neighbors
# Start precomputed_knn
scaler = StandardScaler()
eiip_std = scaler.fit_transform(eiip_mat)
dist_cosine = 1 - pairwise_distances(eiip_std, metric="cosine")
precomputed_knn = nearest_neighbors(dist_cosine, metric="cosine", \
metric_kwds=None, angular=False, \
n_neighbors=n_neighbors, random_state=42)
# Start UMAP
reducer = umap.UMAP(n_components=n_components, precomputed_knn=precomputed_knn)
umap_embed = reducer.fit_transform(eiip_std)
return umap_embed
While I got no errors, the output was not of cosine
at all but of the default euclidian
.
Could you please point to the mistake in the above code and suggest any improvements?
Thanks in advance
I think you need to set the metric as 'cosine' so the reducer in your code becomes:
reducer = umap.UMAP(n_components=n_components, metric='cosine', precomputed_knn=precomputed_knn)