machine-learningcluster-analysisdbscan

Best way to validate DBSCAN Clusters


I have used the ELKI implementation of DBSCAN to identify fire hot spot clusters from a fire data set and the results look quite good. The data set is spatial and the clusters are based on latitude, longitude. Basically, the DBSCAN parameters identify hot spot regions where there is a high concentration of fire points (defined by density). These are the fire hot spot regions.

My question is, after experimenting with several different parameters and finding a pair that gives a reasonable clustering result, how does one validate the clusters?

Is there a suitable formal validation method for my use case? Or is this subjective depending on the application domain?


Solution

  • ELKI contains a number of evaluation functions for clusterings.

    Use the -evaluator parameter to enable them, from the evaluation.clustering.internal package.

    Some of them will not automatically run because they have quadratic runtime cost - probably more than your clustering algorithm.

    I do not trust these measures. They are designed for particular clustering algorithms; and are mostly useful for deciding the k parameter of k-means; not much more than that. If you blindly go by these measures, you end up with useless results most of the time. Also, these measures do not work with noise, with either of the strategies we tried.

    The cheapest are the label-based evaluators. These will automatically run, but apparently your data does not have labels (or they are numeric, in which case you need to set the -parser.labelindex parameter accordingly). Personally, I prefer the Adjusted Rand Index to compare the similarity of two clusterings. All of these indexes are sensitive to noise so they don't work too well with DBSCAN, unless your reference has the same concept of noise as DBSCAN.

    If you can afford it, a "subjective" evaluation is always best.

    You want to solve a problem, not a number. That is the whole point of "data science", being problem oriented and solving the problem, not obsessed with minimizing some random quality number. If the results don't work in reality, you failed.