pythonscikit-learnhierarchical-clustering

sklearn agglomerative clustering input data


I have a similarity matrix between four users. I want to do an agglomerative clustering. the code is like this:

lena = np.matrix('1 1 0 0;1 1 0 0;0 0 1 0.2;0 0 0.2 1')
X = np.reshape(lena, (-1, 1))

print("Compute structured hierarchical clustering...")
st = time.time()
n_clusters = 3 # number of regionsle


ward = AgglomerativeClustering(n_clusters=n_clusters,
        linkage='complete').fit(X)
print ward
label = np.reshape(ward.labels_, lena.shape)
print("Elapsed time: ", time.time() - st)
print("Number of pixels: ", label.size)
print("Number of clusters: ", np.unique(label).size)
print label

the print result of label is like:

[[1 1 0 0]
 [1 1 0 0]
 [0 0 1 2]
 [0 0 2 1]]

Does this mean it gives a lists of possible cluster result, we can choose one from them? like choosing: [0,0,2,1]. If is wrong, could you tell me how to do the agglomerative algorithm based on similarity? If it'ss right, the similarity matrix is huge, how can i choose the optimal clustering result from a huge list? Thanks


Solution

  • I think the problem here is that you fit your model with the wrong data

    # This will return a 4x4 matrix (similarity matrix)
    lena = np.matrix('1 1 0 0;1 1 0 0;0 0 1 0.2;0 0 0.2 1')
    
    # However this will return 16x1 matrix
    X = np.reshape(lena, (-1, 1))
    

    The true result you get is this:

     ward.labels_
     >> array([1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 2, 0, 0, 2, 1])
    

    Which is the label of each element in the X vector and it don't make sens

    If I well understood your problem, you need to classify your users by distance between them (similarity). Well, in this case I will suggest to use spectral clustering this way:

    import numpy as np
    from sklearn.cluster import SpectralClustering
    
    lena = np.matrix('1 1 0 0;1 1 0 0;0 0 1 0.2;0 0 0.2 1')
    
    n_clusters = 3
    SpectralClustering(n_clusters).fit_predict(lena)
    
    >> array([1, 1, 0, 2], dtype=int32)