I have 2 questions for the following code:
How to get the 'points' in each cluster in the result?
How could the result contains 3 clusters with one that has size is 0?
import de.lmu.ifi.dbs.elki.data.model.Model;
import de.lmu.ifi.dbs.elki.database.StaticArrayDatabase;
import de.lmu.ifi.dbs.elki.datasource.ArrayAdapterDatabaseConnection;
import de.lmu.ifi.dbs.elki.datasource.DatabaseConnection;
import de.lmu.ifi.dbs.elki.distance.distancefunction.geo.LatLngDistanceFunction;
import de.lmu.ifi.dbs.elki.math.geodesy.WGS84SpheroidEarthModel;
import java.util.List;
import de.lmu.ifi.dbs.elki.algorithm.clustering.DBSCAN;
import de.lmu.ifi.dbs.elki.data.Cluster;
import de.lmu.ifi.dbs.elki.data.Clustering;
import de.lmu.ifi.dbs.elki.data.DoubleVector;
/**
*
* @author Paul Z. Wu Jan 14, 2018
*/
public class DBScan {
public static void main(String args[]) {
final double[][] data = new double[][]{{48.774332, -78.532054}, {40.774032, -73.531154},
{40.774232, -73.531084}, {48.774332, -78.531054}};
DatabaseConnection dbc = new ArrayAdapterDatabaseConnection(data);
DBSCAN<DoubleVector> scan = new DBSCAN<>(new LatLngDistanceFunction(WGS84SpheroidEarthModel.STATIC), 2000, 1);
StaticArrayDatabase db = new StaticArrayDatabase(dbc, null);
db.initialize();
Clustering<Model> c = scan.run(db);
System.out.println(c.getAllClusters().isEmpty());
List<Cluster<Model>> list = c.getAllClusters();
for (Cluster<Model> cl : list) {
System.out.println("size=" + cl.size());
System.out.println("...." + cl.getIDs() + "..." + cl.getModel() + " ");
//How to get the original 'points' in this cluster? One of them should
//contain {40.774032, -73.531154},{40.774232, -73.531084}
}
}
See tutorial.javaapi.PassingDataToELKI line 73
Relation<NumberVector> rel = db.getRelation(TypeUtil.NUMBER_VECTOR_FIELD);
and See tutorial.javaapi.PassingDataToELKI lines 102-104
for(DBIDIter it = clu.getIDs().iter(); it.valid(); it.advance()) {
// To get the vector use:
NumberVector v = rel.get(it);
}
ELKI uses a "tidy data" architecture. Most algorithms expect a database relation (think: column, or table) of vectors. Not unlike column stores, actually, but there is nothing to be gainen on the side of compression on continuous dense data. Often with a fixed dimensionality (=a vector field). For geodata, you could even specify this to have exactly 2 dimensions.
Labels would be stored in a second relation/table/column.
Also see the GeoIndexing example to scale DBSCAN to larger data sets. I have used OPTICS on 23 million geo coordinates, but it takes a while to run, obviously (not days, though). I recommend enabling progress logging for large data sets, which even tries to estimate the remaining time.