machine-learningscikit-learnk-meanseuclidean-distancewmd

In K-Means clustering algorithm(sklearn) how to override euclidean distance to some distance


I have some set of documents, I just want to group related docs. Currently I'm using google's news vector file (GoogleNews-vectors-negative300.bin) and with this vector file I'm getting the vector and I use WMD (Word Mover Distance) algorithm to get distance between two documents. Now I want to integrate this with K-means clustering.Basically I want to override the distance calculation function in KMeans. How can I do that? Any suggestion are most welcome. Thanks in advance.


Solution

  • Although it is possible in theory implement k-means with other distance measures, it is not advised - your algorithm could stop converging. More detailed discussion can be found e.g. on StackExchange. That's why scikit-learn does not feature other distance metrics.

    I'd suggest using e.g. hierarchical clustering, where you can plug in arbitrary distance function.