Is it required that DBSCAN and its index have the same distance function? If it is not, what are the cases when it is needed to use different distance functions?
Scala code how I create DBSCAN and index:
import de.lmu.ifi.dbs.elki.algorithm.clustering.DBSCAN
import de.lmu.ifi.dbs.elki.algorithm.clustering.gdbscan.parallel.ParallelGeneralizedDBSCAN
import de.lmu.ifi.dbs.elki.data.model.Model
import de.lmu.ifi.dbs.elki.data.{Clustering, DoubleVector, NumberVector}
import de.lmu.ifi.dbs.elki.database.{Database, StaticArrayDatabase}
import de.lmu.ifi.dbs.elki.datasource.ArrayAdapterDatabaseConnection
import de.lmu.ifi.dbs.elki.distance.distancefunction.NumberVectorDistanceFunction
import de.lmu.ifi.dbs.elki.distance.distancefunction.minkowski.SquaredEuclideanDistanceFunction
import de.lmu.ifi.dbs.elki.index.tree.metrical.covertree.SimplifiedCoverTree
def createDatabase(data: Array[Array[Double]], distanceFunction: NumberVectorDistanceFunction[NumberVector]): Database = {
val indexFactory = new SimplifiedCoverTree.Factory[NumberVector](distanceFunction, 1.3, 20)
// Create a database
val db = new StaticArrayDatabase(new ArrayAdapterDatabaseConnection(data), java.util.Arrays.asList(indexFactory))
// Load the data into the database
db.initialize()
db
}
def dbscanClustering(data: Array[Array[Double]], distanceFunction: NumberVectorDistanceFunction[NumberVector]): Unit = {
// Use the same `distanceFunction` for the database and DBSCAN <- is it required??
val db = createDatabase(data, distanceFunction)
val dbscan = new DBSCAN[DoubleVector](distanceFunction, 0.01, 20)
val result: Clustering[Model] = dbscan.run(db)
println(s"Number of clusters: ${result.getAllClusters.size()}")
result.getAllClusters.asScala.zipWithIndex.foreach { case (cluster, idx) =>
println(s"# $idx: ${cluster.getNameAutomatic}")
println(s"Size: ${cluster.size()}")
println(s"Model: ${cluster.getModel}")
}
val inputData: Array[Array[Double]] = ???
dbscanClustering(inputData, SquaredEuclideanDistanceFunction)
The index can only be used for acceleration if it uses the same distance function. Some indexes can support multiple (but not arbitrary) distances, for example the R*-tree can support all spatial distance functions (to a varying success though).
Obviously if you build an index to accelerate Cosine distance, but you ask for Euclidean nearest neighbors, the index cannot and will not be used.
You do not need to use an index, but without your runtime will be O(n²); with an index it can be much faster (depending on parameters, dimensionality, etc. - in the worst case, the index is overhead).