pythonscikit-learnhierarchical-clusteringcosine-similaritydistance-matrix

Hierarchical clustering with precomputed cosine similarity matrix using scikit learn produces error


We want to use cosine similarity with hierarchical clustering and we have cosine similarities already calculated. In the sklearn.cluster.AgglomerativeClustering documentation it says:

A distance matrix (instead of a similarity matrix) is needed as input for the fit method.

So, we converted cosine similarities to distances as

distance = 1 - similarity

Our python code produces error at the fit() method at the end. (I am not writing the real value of X in the code, since it is very big.) X is just a cosine similarity matrix with values converted to distance as written above. Notice the diagonal, it is all 0.) Here is the code:

import pandas as pd
import numpy as np 
from sklearn.cluster import AgglomerativeClustering

X = np.array([0,0.3,0.4],[0.3,0,0.7],[0.4,0.7,0])

cluster = AgglomerativeClustering(affinity='precomputed')  
cluster.fit(X)

The error is:

runfile('/Users/stackoverflowuser/Desktop/4.2/Pr/untitled0.py', wdir='/Users/stackoverflowuser/Desktop/4.2/Pr')
Traceback (most recent call last):

  File "<ipython-input-1-b8b98765b168>", line 1, in <module>
    runfile('/Users/stackoverflowuser/Desktop/4.2/Pr/untitled0.py', wdir='/Users/stackoverflowuser/Desktop/4.2/Pr')

  File "/anaconda2/lib/python2.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 704, in runfile
    execfile(filename, namespace)

  File "/anaconda2/lib/python2.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 100, in execfile
    builtins.execfile(filename, *where)

  File "/Users/stackoverflowuser/Desktop/4.2/Pr/untitled0.py", line 84, in <module>
    cluster.fit(X)

  File "/anaconda2/lib/python2.7/site-packages/sklearn/cluster/hierarchical.py", line 795, in fit
    (self.affinity, ))

ValueError: precomputed was provided as affinity. Ward can only work with euclidean distances.

Is there anything that I can provide? Thanks already.


Solution

  • According to sklearn's documentation:

    If linkage is “ward”, only “euclidean” is accepted. If “precomputed”, a distance matrix (instead of a similarity matrix) is needed as input for the fit method.

    So you need to change the linkage to one of complete, average or single.

    Answer taken from: https://datascience.stackexchange.com/questions/51970/hierarchical-clustering-with-precomputed-cosine-similarity-matrix-using-scikit-l/