I'm confused about the difference between the following parameters in HDBSCAN
Correct me if I'm wrong.
For min_samples
, if it is set to 7, then clusters formed need to have 7 or more points.
For cluster_selection_epsilon
if it is set to 0.5 meters, than any clusters that are more than 0.5 meters apart will not be merged into one. Meaning that each cluster will only include points that are 0.5 meters apart or less.
How is that different from min_cluster_size
?
They technically do two different things.
min_samples
= the minimum number of neighbours to a core point. The higher this is, the more points are going to be discarded as noise/outliers. This is from DBScan part of HDBScan.
min_cluster_size
= the minimum size a final cluster can be. The higher this is, the bigger your clusters will be. This is from the H part of HDBScan.
Increasing min_samples
will increase the size of the clusters, but it does so by discarding data as outliers using DBSCAN.
Increasing min_cluster_size
while keeping min_samples
small, by comparison, keeps those outliers but instead merges any smaller clusters with their most similar neighbour until all clusters are above min_cluster_size
.
So:
min_samples
and a small min_cluster_size
.min_samples
and a large min_cluster_size
min_samples
and a large min_cluster_size
.(It's not possible to use min_samples larger than min_cluster_size, afaik)