cluster-analysisjavadbscanelki

How to add index in Database using ELKI java API for Custom POJO with String type fields


I am using DBSCAN to cluster some categorical data using a POJO. My class looks like this

public class Dimension {
    private String app;
    private String node;
    private String cluster;
 .............

All my fields are String instead of integer or Float because they are discrete/categorical value. Rest of my code is as follows.

    final SimpleTypeInformation<Dimension> dimensionTypeInformation = new SimpleTypeInformation<>(Dimension.class);
    PrimitiveDistanceFunction<Dimension> dimensionPrimitiveDistanceFunction = new PrimitiveDistanceFunction<Dimension>() {
        public double distance(Dimension d1, Dimension d2) {
            return simpleMatchingCoefficient(d1, d2);
        }
        public SimpleTypeInformation<? super Dimension> getInputTypeRestriction() {
            return dimensionTypeInformation;
        }
        public boolean isSymmetric() {
            return true;
        }
        public boolean isMetric() {
            return true;
        }
        public <T extends Dimension> DistanceQuery<T> instantiate(Relation<T> relation) {
            return new PrimitiveDistanceQuery<>(relation, this);
        }
    };
    DatabaseConnection dbc = new DimensionDatabaseConnection(dimensionList);
    Database db = new StaticArrayDatabase(dbc, null);
    db.initialize();
    DBSCAN<Dimension> dbscan = new DBSCAN<>(dimensionPrimitiveDistanceFunction, 0.6, 20);
    Result result = dbscan.run(db);

Now as expected this code works fine for small dataset but gets very very slow when my dataset gets bigger. So I want to add an index to speed up the process. But all the index that I could think of require me to implement NumberVector. But my class has only Strings not number. What index can I use in this case ? can I use the distance function, double simpleMatchingCoefficient(Dimension d1, Dimension d2) to create an IndexFactory ?

Thanks in advance.


Solution

  • There are (at least) three broad families of indexes:

    1. Coordinate based indexes, such as the k-d-tree and R-tree. These work well on dense, continuous variables
    2. Metric indexes, that require the distance function to satisfy the triangle inequality. These can work on any kind of data, but may still need a fairly smooth distribution of distance values (e.g., they will not help with the discrete metric, that is 0 of x=y and 1 otherwise).
    3. Inverted lookup indexes. They are mostly used for text search, and exploit that for each attribute only a small subset of the data is relevant. These work well for high-cardinality discrete attributes.

    In your case, I'd consider an inverted index. If you have a lot of attributes, a metric index may work, but I doubt that holds, because you use POJOs with strings to store your data.

    And of course, profile your code and check if you can improve the implementation of your distance function! E.g. string interning may help, it can reduce matching time of strings to equality testing rather than comparing each character...