dbscanelki

ELKI DBSCAN Ignore Columns


I have a CSV file with multiple columns where the columns are "lat, lon, item1". I have been able to load the data using the following code:

ListParameterization params = new ListParameterization();
List<ObjectFilter> filterlist = new ArrayList<>();
filterlist.add(new FixedDBIDsFilter(1));
NumberVectorLabelParser<DoubleVector> parser = new NumberVectorLabelParser<>(DoubleVector.FACTORY);         
FileBasedDatabaseConnection dbc = new FileBasedDatabaseConnection(filterlist, parser, is);    
params.addParameter(AbstractDatabase.Parameterizer.DATABASE_CONNECTION_ID, dbc);
Database db = ClassGenericsUtil.parameterizeOrAbort(StaticArrayDatabase.class, params);
db.initialize();

I have also run a DBSCAN and retrieved the number of clusters and I can pull the data from the clusters.

ListParameterization params1 = new ListParameterization();
params1.addParameter(DBSCAN.Parameterizer.EPSILON_ID, 0.05);
params1.addParameter(DBSCAN.Parameterizer.MINPTS_ID, 2);
DBSCAN<DoubleVector> dbscan = ClassGenericsUtil.parameterizeOrAbort(DBSCAN.class, params1);
Clustering<Model> result = dbscan.run(db);

I can see that the DBSCAN uses all three columns for the scan because when I only use two columns, lat, lon, I get a different number of clusters.

I would like to have all columns in my database for later access but only cluster off the lat/lon columns. I believe that I need something to mark the other columns so that they are not used but I cannot find the correct answer. I thought the following would work but it did not:

params.addParameter(NumberVectorLabelParser.Parameterizer.LABEL_INDICES_ID, 2);

Can someone help me on this?


Solution

  • You need to pass this parameter to the NumberVectorLabelParser via its long[] labelIndices bitmask parameter (this is currently not an array of integers, but a bit mask, so you want new long[]{4L}).

    You are currently passing the parameter to the database, which doesn't have this parameter.

    Or you could use DimensionSelectingLatLngDistanceFunction; because you shouldn't use Euclidean distance on latitude and longitude anyway.