rcluster-analysisunsupervised-learningdbscanhdbscan

HBSCAN membership probability


I'm working on a comparison between clustring algorithms and I want to know how HDBSCAN in R calculate the so called the membership 'probability' ?


Solution

  • In the dbscan package, the hdbscan() function does some validity checking of the object passed as input, and then calculates a distance matrix to its k nearest neighbors using the dbscan::kNNdist() function. The value of k is set to the argument minPts that is passed to the dbscan() function less 1.

     core_dist <- kNNdist(x, k = minPts - 1)
    

    It then uses core distance as the measure of density and calculates membership probabilities using the following algorithm (from the hdbscan.R source ):

      ## Generate membership 'probabilities' using core distance as the measure of density
      prob <- rep(0, length(cl))
      for (cid in sl){
        ccl <- res[[as.character(cid)]]
        max_f <- max(core_dist[which(cl == cid)])
        pr <- (max_f - core_dist[which(cl == cid)])/max_f
        prob[cl == cid] <- pr
      }
    

    For each cluster id in the salient clusters object sl, the algorithm calculates the maximum core distance, and then builds probabilities by subtracting each element's distance from the maximum distance, dividing the result by the maximum distance to convert it a proportion.

    These coverage probabilities are then inserted into the list that is output by the hdbscan() function as the membership_prob object.