I have a geographic dataset 'coordinates' in UTM coordinates that I am performing HDBSCAN on and would like to have sklearn's GridSearchCV validate various parameters using DBCV. While manually evaluating the parameters for HDBSCAN I got the following result, which is better than sklearn's GridSearchCV:
clusters = hdbscan.HDBSCAN(min_cluster_size=75, min_samples=60,
cluster_selection_method ='eom', gen_min_span_tree=True,
prediction_data=True).fit(coordinates)
Obtained DBCV Score: 0.2580606238793024
When using sklearn's GridSearchCV it chooses model parameters that obtain a lower DBCV value, even though the manually chosen parameters are in the dictionary of parameters. As an aside, while playing around with the RandomizedSearchCV I was able to obtain a DBCV value of 0.28 using a different range of parameters, but didn't write down what parameters were utilized.
*Update: When I run the RandomizedSearchCV & GridSearchCV the 'best' model chosen is the first item in the parameter grid or the first chosen random sample. For example, in the code below, it always picks the first entries in min_samples & min_cluster_size. I suspect because it encounters an error. When I add error_score="raise" it raises a TypeError, which is likely related to the fact that it can't compare to a y, but this is unsupervised clustering with not data labels.
TypeError: _BaseScorer.call() missing 1 required positional argument: 'y_true'
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
import hdbscan
from sklearn.metrics import make_scorer
import logging # to further silence deprecation warnings
logging.captureWarnings(True)
# ### GridSearch CV Model Tuning ###
logging.captureWarnings(True)
hdb = hdbscan.HDBSCAN(gen_min_span_tree=True).fit(coordinates)
# # specify parameters to sample from
grid = {'min_samples': [50,55,60,65,70,75,80,90,100,110],
'min_cluster_size':[40,45,50,55,60,65,75,80,85,90,95,100],
'cluster_selection_method' : ['eom','leaf'],
'metric' : ['euclidean','manhattan']
}
#validity_scroer = "hdbscan__hdbscan___HDBSCAN__validity_index"
validity_scorer = make_scorer(hdbscan.validity.validity_index,greater_is_better=True)
grid_search = GridSearchCV(hdb
,param_grid=grid
,scoring=validity_scorer)
grid_search.fit(coordinates)
print(f"Best Parameters {grid_search.best_params_}")
print(f"DBCV score :{grid_search.best_estimator_.relative_validity_}")
Best Parameters {'cluster_selection_method': 'eom', 'metric': 'euclidean', 'min_cluster_size': 40, 'min_samples': 50} DBCV score :0.22213170637127946
# Naive grid search implementation by Mueller and Guido, Introduction to Machine Learning with Python
best_score = 0
for min_cluster_size in [40,45,120,50,55,130,140,150,155,160]:
for min_samples in [40,45,50,85,55,60,90,100,110,115,120]:
for cluster_selection_method in ['eom','leaf']:
for metric in ['euclidean']:
# for each combination of parameters of hdbscan
hdb = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size,min_samples=min_samples,
cluster_selection_method=cluster_selection_method, metric=metric,
gen_min_span_tree=True).fit(coordinates)
# DBCV score
score = hdb.relative_validity_
# if we got a better DBCV, store it and the parameters
if score > best_score:
best_score = score
best_parameters = {'min_cluster_size': min_cluster_size,
' min_samples': min_samples, 'cluster_selection_method': cluster_selection_method,
'metric': metric}
print("Best DBCV score: {:.3f}".format(best_score))
print("Best parameters: {}".format(best_parameters))
Outputs:
Best DBCV score: 0.414 Best parameters: {'min_cluster_size': 150, ' min_samples': 90, 'cluster_selection_method': 'eom', 'metric': 'euclidean'}