machine-learningcluster-analysisdata-miningelkioptics-algorithm

Obtaining cluster boundaries of ELKI OPTICSXi


I have a one-dimensional data set for which the histogram plot shows multiple local maxima, so I know that there are multiple regions in my one-dimensional space where the data is more dense. I want to determing boundaries for these dense regions that allow me to classify the dense region / cluster that a certain data point is in. For this I am using OPTICS, because it should be able to better deal with the different densities between the clusters compared to DBSCAN.

I am using ELKI (version 0.6.0) in Java code (I know it is disrecommended by the ELKI team to embed ELKI in Java, but I need to repeat my workflow for many datasets and therefore its better to automate this in my case). Code snippet below prints indices of the start and end items of the clusters. The ELKI documentation on OPTICSModel does not clearly define what these index numbers correspond to, but I assume these are the indices of the start and end data items in the augmented cluster-ordering of the database (like the ClusterOrderResult object that OPTICS.run()-created), as opposed to indices of the start and end data items of the database itself (unordered).

ListParameterization opticsParams = new ListParameterization();
opticsParams.addParameter(OPTICSXi.XI_ID, 0.01);
opticsParams.addParameter(OPTICS.MINPTS_ID, 100);
OPTICSXi<DoubleDistance> optics = ClassGenericsUtil.parameterizeOrAbort(OPTICSXi.class, opticsParams);

ArrayAdapterDatabaseConnection arrayAdapterDatabaseConnection = new ArrayAdapterDatabaseConnection(myListOfOneDimensionalFeatureVectors.toArray(new double[myListOfOneDimensionalFeatureVectors.size()][2]));
ListParameterization dbParams = new ListParameterization();
dbParams.addParameter(AbstractDatabase.Parameterizer.INDEX_ID, RStarTreeFactory.class);
dbParams.addParameter(RStarTreeFactory.Parameterizer.BULK_SPLIT_ID, SortTileRecursiveBulkSplit.class);
dbParams.addParameter(AbstractDatabase.Parameterizer.DATABASE_CONNECTION_ID, arrayAdapterDatabaseConnection);

Database db = ClassGenericsUtil.parameterizeOrAbort(StaticArrayDatabase.class, dbParams);
db.initialize();

result = optics.run(db);
List<Cluster<OPTICSModel>> clusters = result.getAllClusters();
    for(Cluster<OPTICSModel> cluster : clusters){
        if(!cluster.isNoise())
            System.out.println(cluster.getModel().getStartIndex() + ", "+ cluster.getModel().getEndIndex() +";  ");
    }

Now I want to know where in my one-dimensional space my clusters start and end. Therefore I would like to retrieve the data items corresponding to the start and end indices that my code above already obtains. I assume that I would need a ClusterOrderResult-object for that from which I could then retrieve the obtained indices. In the documentation however it seems like it is not possible to retrieve such a thing from the Clustering result object that I obtained by calling optics.run(). As there seemed to be no way of obtaining this ordered databased, I naively tried obtaining the indices from my original input dataset instead by replacing the println in the code above with the println below:

System.out.println(myListOfOneDimensionalFeatureVectors.get(cluster.getModel().getStartIndex())[0] + ", "+ myListOfOneDimensionalFeatureVectors.get(cluster.getModel().getEndIndex())[0] +";  ";

As I allready expected however, the indices do not seem to belong to the original input file, as this regularly prints end boundaries with lower values in my one dimensional space than the end boundaries. Does anybode know any way to obtain the original 1-dimensional data values that correspond to the start and end indices found with OPTICS clustering? I want to use these values later in my code.


Solution

  • For purpose of automation, it does work very well to call ELKI from the command line. That is my preferred way, because this way each run is nicely isolated in its own JVM.

    You would then have easy access to this data from the output files.

    Why are you using an old version of ELKI? 0.6.5 versions are much nicer because of the removed generics. Although I've switched to the github version now.

    If you want direct access to the ClusterOrder object, it's attached to the clustering object as a child result. You should be able to get it using

    ClusterOrder clusterOrder = ResultUtil.filterResults(clustering, ClusterOrder.class).get(0);
    

    and its object ids via:

    ArrayDBIDs ids = DBIDUtil.ensureArray(clusterOrder.getDBIDs());
    

    (The ensureArray is overhead, but it's a noop then anyway - it's a cast-or-convert operation, and here it will be a cast; at least in my ELKI version the ids are always stored as ArrayDBIDs)

    Array iterators (DBIDArrayIter it = ids.iter()) can be moved to a position via seek(offset). So you should be able to use something like

    DBIDArrayIter it = ids.iter();
    NumberVector vec = relation.get(it.seek(model.getStartIndex()));
    

    The iterators in ELKI are odd for Java APIs, but very fast if you use a single iterator for all your accesses.

    So much for your ELKI question part. However, from a statistical point of view it does not make sense to use OPTICS on 1-dimensional data. On one-dimensional data, use proper kernel density estimation instead. OPTICS is a rough and crude method that makes sense when your data is too complicated to model using proper statistical tools. OPTICS uses a very primitive kernel density, and the xi method is a very naive extraction of clusters from the density plot... at least on one-dimensional data, statistics offers stronger tools. ELKI has an implementation called KNNKernelDensityMinimaClustering, but I have not used it yet. But kernel density estimation should be available in any statistical toolkit, so I would give this class a try.