cluster-analysisdbscanelki

sample_weight option in the ELKI implementation of DBSCAN


My goal is to find outliers in a dataset that contains many near-duplicate points and I want to use ELKI implementation of DBSCAN for this task.

As I don't care about the clusters themselves just the outliers (which I assume are relatively far from the clusters), I want to speed up the runtime by aggregating/binning points on a grid and using the concept implemented in scikit-learn as sample_weight.

Can you please show minimum code to do similar analysis in ELKI?

Let's assume my dataset contains two columns of features (aggregated/binned points' coordinates on the x-y grid) and third column of sample_weights sample_weight_feature (number of original dataset points in the neighbourhood of the aggregated/binned points). In scikit-learn the answer I expect would be -- call function fit in the following way: fit(self, features, y=None, sample_weight=sample_weight_feature)


Solution

  • This is currently not implemented in ELKI, although it can easily be added by means of the GeneralizedDBSCAN class. Instead of counting the neighbors, you'd sum their weights.

    For this you need to modify the CorePredicate of GeneralizedDBSCAN to obtain a "WeightedCorePredicate". As long as you instantiate the objects from Java (and pass the relations directly to the classes) this should be fairly simple - you simply pass the relation of weights when instantiating your "WeightedCorePredicate". It only becomes difficult once you try to make it all available by command line to specify the input format, and how it selects the right relations and columns.

    It's not trivial though to make this command-line and minigui-usable, as you will need a second relation for the weights. From Java code, its fairly easy to do once you have understood the concept of using relations instead of arrays for everything. Roughly, for every neighbor you add the weights from the weight relation and compare it to a threshold instead of comparing the count to the "minpts" integer.

    Since this has recently been requested by another user, I would appreciate a pull request to contribute this to ELKI.

    As for the goal of outlier detection, I suggest to rather use a method designed for outlier detection instead. For example Local Outlier Factor, or even the simple k-nearest-neighbor detectors should work fine, and can be faster than DBSCAN. I am not convinced that your approach yields a lot of benefits - with the help of index structures, DBSCAN usually is quite fast; and likely your de-duplication approach is already as expensive as DBSCAN with a similar grid-based data index.