rvalidationcluster-analysislabelccr

How I can match up the clusters labels of different methods to the actual labels in r?


Basically, I simulate 1000's of data sets and then cluster them through different clustering techniques like: k-means, model-based clustering, etc.

Then, I can validate the performance of the methods using the Classification Correct Rate CCR. However, I face the label switching problem, and thus can't get realistic CCR. So, my question, is there a way to unify all the labels in r for multivariate data sets ?

Here is a simple example:

  # Create the random data sets:

  data1 <- rnorm(5, 0, 0.5) # cluster 1

  data2 <- rnorm(5, 2, 0.5) # cluster 2

  data3 <- rnorm(5, 4, 0.5) # cluster 3

  alldata <- c(data1, data2, data3)

  # cluster the data using different methods:

  require(cluster)

  km.method <- kmeans(alldata, centers = 3)$cluster
  # [1] 3 3 3 3 3 1 1 1 1 1 2 2 2 2 2

  pam.method <- pam(alldata, 3)$clustering
  # [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3


  # As you see the answers are exactly the same, but the labels are different! 
  # How I can unify the labels for all methods to match the true labels??

Solution

  • CCR is not an appropriate measure for clustering.

    As clusterers do not provide classes, it by definition is 0.

    Consider the Iris data set. The correct classes are the species. Clusterings like k-means will produce "labels" 0,1,2. None of these is correct.

    The proper way to evaluate clustering is to use a cluster evaluation measure, such as the adjusted Rand index and normalized mutual information. These evaluate the set overlap, and not the individual labels.