I am using DBSCAN to cluster some categorical data using a POJO. My class looks like this
public class Dimension {
private String app;
private String node;
private String cluster;
.............
All my fields are String instead of integer or Float because they are discrete/categorical value. Rest of my code is as follows.
final SimpleTypeInformation<Dimension> dimensionTypeInformation = new SimpleTypeInformation<>(Dimension.class);
PrimitiveDistanceFunction<Dimension> dimensionPrimitiveDistanceFunction = new PrimitiveDistanceFunction<Dimension>() {
public double distance(Dimension d1, Dimension d2) {
return simpleMatchingCoefficient(d1, d2);
}
public SimpleTypeInformation<? super Dimension> getInputTypeRestriction() {
return dimensionTypeInformation;
}
public boolean isSymmetric() {
return true;
}
public boolean isMetric() {
return true;
}
public <T extends Dimension> DistanceQuery<T> instantiate(Relation<T> relation) {
return new PrimitiveDistanceQuery<>(relation, this);
}
};
DatabaseConnection dbc = new DimensionDatabaseConnection(dimensionList);
Database db = new StaticArrayDatabase(dbc, null);
db.initialize();
DBSCAN<Dimension> dbscan = new DBSCAN<>(dimensionPrimitiveDistanceFunction, 0.6, 20);
Result result = dbscan.run(db);
Now as expected this code works fine for small dataset but gets very very slow when my dataset gets bigger. So I want to add an index to speed up the process. But all the index that I could think of require me to implement NumberVector. But my class has only Strings not number. What index can I use in this case ? can I use the distance function, double simpleMatchingCoefficient(Dimension d1, Dimension d2) to create an IndexFactory ?
Thanks in advance.
There are (at least) three broad families of indexes:
In your case, I'd consider an inverted index. If you have a lot of attributes, a metric index may work, but I doubt that holds, because you use POJOs with strings to store your data.
And of course, profile your code and check if you can improve the implementation of your distance function! E.g. string interning may help, it can reduce matching time of strings to equality testing rather than comparing each character...