scikit-learnlabelcluster-analysismean-shift

Are the labels-output of cluster algorithms ordered in a certain order? (python, scikit learn)


I'm using Shift-means clustering (https://scikit-learn.org/stable/modules/clustering.html#mean-shift) in which the labels of clusters are obtained from this source: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html

However,it's not clear how the labels of clusters (0,1,...) are generated. Appearly, it seems that label 0 is the cluster with more elements. It this a general rule?

How the others algorithms works? it's in a "random" sense? or the algorithms behind detecte the greater clusters for the 0 cluster?

Thanks!

PS: it's easy order the labels according this rule, my question is more theoretical.


Solution

  • In many cases, the cluster order depends on the initialization. If you provide the initial values, then this order will be preserved.

    If you do not provide such initial values, the order will usually be based on the data order. The first item is likely to belong to the first cluster, for example (withholding noise in some algorithms, such as DBSCAN).

    Now quantity (cluster size) has an interesting effect: assuming that your data is randomly ordered (and not, for example, ordered by some synthetic data generation process) then the first element is more likely to belong to the "largest" cluster, so this cluster is most likely to come first even with "random" order.

    Now in sklearn's mean-shift (which in my opinion contains an error in the final assignment rule) the authors decided to sort by "intensity" apparently, but I don't remember any such rule in the original papers. https://github.com/scikit-learn/scikit-learn/blob/7813f7efb/sklearn/cluster/mean_shift_.py#L222