pythoncluster-analysisdistancedoc2vechdbscan

What is the appropriate distance metric when clustering paragraph/doc2vec vectors?


My intent is to cluster document vectors from doc2vec using HDBSCAN. I want to find tiny clusters where there are semantical and textual duplicates.

To do this I am using gensim to generate document vectors. The elements of the resulting docvecs are all in the range [-1,1].

To compare two documents I want to compare the angular similarity. I do this by calculating the cosine similarity of the vectors, which works fine.

But, to cluster the documents HDBSCAN requires a distance matrix, and not a similarity matrix. The native conversion from cosine similarity to cosine distance in sklearn is 1-similarity. However, it is my understanding that using this formula can break the triangle inequality preventing it from being a true distance metric. When searching and looking at other people's code for similar tasks, it seems that most people seem to be using sklearn.metrics.pairwise.pairwise_distances(data, metric='cosine') which is defines cosine distance as 1-similarity anyway. It looks like it provides appropriate results.

I am wondering if this is correct, or if I should use angular distance instead, calculated as np.arccos(cosine similarity)/pi. I have also seen people use Euclidean distance on l2-normalized document vectors; this seems to be equivalent to cosine similarity.

Please let me know what is the most appropriate method for calculating distance between document vectors for clustering :)


Solution

  • I believe in practice cosine-distance is used, despite the fact that there are corner-cases where it's not a proper metric.

    You mention that "elements of the resulting docvecs are all in the range [-1,1]". That isn't usually guaranteed to be the case – though it would be if you've already unit-normalized all the raw doc-vectors.

    If you have done that unit-normalization, or want to, then after such normalization euclidean-distance will always give the same ranked-order of nearest-neighbors as cosine-distance. The absolute values, and relative proportions between them, will vary a little – but all "X is closer to Y than Z" tests will be identical to those based on cosine-distance. So clustering quality should be nearly identical to using cosine-distance directly.