pythonhdbscantsneclusterize

Clustering issue, can't find good params for HDBSCAN


I made a torch model which say if two anime cropped face images are similar or not (trained using cosine similarity and contrastive loss on pairs of faces). I get the embeddings from my model for each image in test dataset:

def calculate_embeddings(model, loader):
    model.eval()

    embeddings = []
    with torch.no_grad():
        for batch_idx, images in enumerate(loader):
            images = images.to(device)
            batch_embeddings = model(images)
            embeddings.extend(batch_embeddings.cpu().numpy())

            print('STEP %d/%d, Batch size: %d' % (batch_idx + 1, len(loader), images.size(0)), end='\t\r')

    return np.array(embeddings)

embeddings = calculate_embeddings(model, clustering_loader)

Then I normalize them and perform clusterization:

from sklearn.preprocessing import normalize
import hdbscan

norm_embeddings = normalize(embeddings, norm='l2')
clusterer = hdbscan.HDBSCAN(min_cluster_size=6, min_samples=6)
clusters = clusterer.fit_predict(norm_embeddings)

Here is TSNE representation of what I want to get (correct clusters based on my dataset): Ideal TSNE

Here is what I get: My TSNE

It will be good if there will be some noise clusters, but not 14226. My dataset contains 18969 cropped faces. Can you suggest params for HDBSCAN? I tried to increase min_saples but then I got one huge cluster and a some smaller. Or may be I should use others clusterers? I tried to use DBSCAN, but it produced even worse results then HDBSCAN. Or my model is not trained enough?


Solution

  • It seems a problem related to fine tuning DBSCAN parameters (Mainly the "epsilon" radius to look around each point and the "min_samples" to consider if a point is a core point). DBSCAN is usually hard to fine tune as there is no easy to follow methodologies to do it.

    I suggest you a workaround with a simple method. Find the most suitable number of clusters in your dataset with the Elbow method and KMEANS.

    
    from sklearn.cluster import KMeans
    import matplotlib.pyplot as plt
    from sklearn.preprocessing import normalize
    
    
    #create list to hold SSE values for each k
    sse = []
    X = normalize(embeddings, norm='l2')
    clusters_range = range(150, 250)
    
    for n in clusters_range:
        kmeans = KMeans(n_clusters=k, **kmeans_kwargs)
        kmeans.fit(X)
        sse.append(kmeans.inertia_)
    
    #visualize results
    plt.plot(clusters_range, sse)
    plt.xticks(clusters_range)
    plt.xlabel("Number of Clusters")
    plt.ylabel("SSE")
    plt.show()
    

    You should see that the elbow (optimal number of clusters) falls near to 237 as you mentioned.

    This question presents some good examples too: Finding the optimal number of clusters using the elbow method and K- Means clustering

    From my point of view, if the elbow falls near 237, let's say between (230, 240), your embedding worked well as the vectors are grouping similarly compared to the original data. And I would inspect some clusters to confirm the similarity between their elements.