Here I can see that there exists class clustering.gdbscan.parallel.ParallelGeneralizedDBSCAN
, but when I tried to invoke it, I've got error:
java -cp elki.jar de.lmu.ifi.dbs.elki.application.KDDCLIApplication -algorithm clustering.gdbscan.parallel.ParallelGeneralizedDBSCAN -algorithm.distancefunction EuclideanDistanceFunction -dbc.in infile.txt -dbscan.epsilon 1.0 -dbscan.minpts 1 -verbose -out OUTFOLDER
Class 'clustering.gdbscan.parallel.ParallelGeneralizedDBSCAN' not found for given value. Must be a subclass / implementation of de.lmu.ifi.dbs.elki.algorithm.Algorithm
And this class is indeed absent in the list of available classes which was printed out with error message:
-> clustering.CanopyPreClustering
-> clustering.DBSCAN
-> clustering.affinitypropagation.AffinityPropagationClusteringAlgorithm
-> clustering.em.EM
-> clustering.gdbscan.GeneralizedDBSCAN
-> clustering.gdbscan.LSDBC
-> clustering.GriDBSCAN
-> clustering.hierarchical.extraction.HDBSCANHierarchyExtraction
-> clustering.hierarchical.extraction.SimplifiedHierarchyExtraction
-> clustering.hierarchical.extraction.ExtractFlatClusteringFromHierarchy
-> clustering.hierarchical.SLINK
-> clustering.hierarchical.AnderbergHierarchicalClustering
-> clustering.hierarchical.AGNES
-> clustering.hierarchical.CLINK
-> clustering.hierarchical.SLINKHDBSCANLinearMemory
-> clustering.hierarchical.HDBSCANLinearMemory
-> clustering.kmeans.KMeansSort
-> clustering.kmeans.KMeansCompare
-> clustering.kmeans.KMeansHamerly
-> clustering.kmeans.KMeansElkan
-> clustering.kmeans.KMeansLloyd
-> clustering.kmeans.parallel.ParallelLloydKMeans
-> clustering.kmeans.KMeansMacQueen
-> clustering.kmeans.KMediansLloyd
-> clustering.kmeans.KMedoidsPAM
-> clustering.kmeans.KMedoidsEM
-> clustering.kmeans.CLARA
-> clustering.kmeans.BestOfMultipleKMeans
-> clustering.kmeans.KMeansBisecting
-> clustering.kmeans.KMeansBatchedLloyd
-> clustering.kmeans.KMeansHybridLloydMacQueen
-> clustering.kmeans.SingleAssignmentKMeans
-> clustering.kmeans.XMeans
-> clustering.NaiveMeanShiftClustering
-> clustering.optics.DeLiClu
-> clustering.optics.OPTICSXi
-> clustering.optics.OPTICSHeap
-> clustering.optics.OPTICSList
-> clustering.optics.FastOPTICS
-> clustering.SNNClustering
-> clustering.biclustering.ChengAndChurch
-> clustering.correlation.CASH
-> clustering.correlation.COPAC
-> clustering.correlation.ERiC
-> clustering.correlation.FourC
-> clustering.correlation.HiCO
-> clustering.correlation.LMCLUS
-> clustering.correlation.ORCLUS
-> clustering.onedimensional.KNNKernelDensityMinimaClustering
-> clustering.subspace.CLIQUE
-> clustering.subspace.DiSH
-> clustering.subspace.DOC
-> clustering.subspace.HiSC
-> clustering.subspace.P3C
-> clustering.subspace.PreDeCon
-> clustering.subspace.PROCLUS
-> clustering.subspace.SUBCLU
-> clustering.meta.ExternalClustering
-> clustering.trivial.ByLabelClustering
-> clustering.trivial.ByLabelHierarchicalClustering
-> clustering.trivial.ByModelClustering
-> clustering.trivial.TrivialAllInOne
-> clustering.trivial.TrivialAllNoise
-> clustering.trivial.ByLabelOrAllInOneClustering
-> clustering.uncertain.FDBSCAN
-> clustering.uncertain.CKMeans
-> clustering.uncertain.UKMeans
-> clustering.uncertain.RepresentativeUncertainClustering
-> clustering.uncertain.CenterOfMassMetaClustering
I thought that perhaps this method is internal and is invoked by clustering.gdbscan.GeneralizedDBSCAN
, but it works single core for me. Maybe I need to add some command line parameter to enable multiprocessing?
EDIT: thanks to @erich-schubert, now I can see the time estimation. I have used M-tree index there as shown in docs:
java -Xmx32000M -cp elki-bundle-0.7.2-SNAPSHOT.jar de.lmu.ifi.dbs.elki.application.KDDCLIApplication -algorithm clustering.gdbscan.parallel.ParallelGeneralizedDBSCAN -db.index tree.metrical.mtreevariants.mtree.MTreeFactory -treeindex.pagesize 4096 -mtree.distancefunction EuclideanDistanceFunction -algorithm.distancefunction EuclideanDistanceFunction -dbc.in dump_txt.txt -dbscan.epsilon 1.0 -dbscan.minpts 1 -verbose -out RES
I've got warning about ignored parameter:
following parameters were not processed: [-treeindex.pagesize, 4096]
and quite depressive time estimation which continues to grow:
de.lmu.ifi.dbs.elki.datasource.FileBasedDatabaseConnection.load: 553728 ms
Relation does not have a dimensionality -- simulating M-tree as external index!
de.lmu.ifi.dbs.elki.index.tree.metrical.mtreevariants.mtree.MTreeIndex.directory.capacity: 200
de.lmu.ifi.dbs.elki.index.tree.metrical.mtreevariants.mtree.MTreeIndex.directory.minfill: 0
de.lmu.ifi.dbs.elki.index.tree.metrical.mtreevariants.mtree.MTreeIndex.leaf.capacity: 333
de.lmu.ifi.dbs.elki.index.tree.metrical.mtreevariants.mtree.MTreeIndex.leaf.minfill: 0
de.lmu.ifi.dbs.elki.index.tree.metrical.mtreevariants.mtree.MTreeIndex.construction: 806160 ms
Index statistics before running algorithms:
de.lmu.ifi.dbs.elki.persistent.MemoryPageFile.reads: 22344677
de.lmu.ifi.dbs.elki.persistent.MemoryPageFile.writes: 3831053
de.lmu.ifi.dbs.elki.persistent.MemoryPageFile.numpages: 17472
de.lmu.ifi.dbs.elki.index.tree.metrical.mtreevariants.mtree.MTreeIndex.height: 2
de.lmu.ifi.dbs.elki.index.tree.metrical.mtreevariants.AbstractMTree$Statistics.distancecalcs: 1773733054
de.lmu.ifi.dbs.elki.index.tree.metrical.mtreevariants.AbstractMTree$Statistics.knnqueries: 0
de.lmu.ifi.dbs.elki.index.tree.metrical.mtreevariants.AbstractMTree$Statistics.rangequeries: 0
DBSCAN clustering: 708 [ 0%] 33738 min remaining
My data is 3.5M 300-dimensional word2vec vectors (float). Can I optimize it somehow to run in a reasonable time?
I use -dbscan.minpts 1
because I've just found the distance between vectors which corresponds to similarity.
EDIT2: R-tree index is a bit faster:
DBSCAN clustering: 4423 [ 0%] 17248 min remaining
The parallel DBSCAN version is not in the 0.7.1 release, but you need to compile it yourself.
It currently does not include progress logging, and it is a rather naive parallelization. It works okay if the majority of time is spent in neighbor search, because the cluster labeling is synchronized. (But if all your cores are loaded, synchronization should be fine).
I just pushed a change that adds progress logging to Parallel GDBSCAN.
Make sure to add an index. For most data sets, indexes yield considerable speedups. With indexes, the rather poor parallelization of this implementation will surface, and you see more and more threads waiting for synchronization.