scikit-learncluster-analysishdbscan

Lower DBCV Scores for Cluster Analysis using Sklearn's GridSearchCV


I have a geographic dataset 'coordinates' in UTM coordinates that I am performing HDBSCAN on and would like to have sklearn's GridSearchCV validate various parameters using DBCV. While manually evaluating the parameters for HDBSCAN I got the following result, which is better than sklearn's GridSearchCV:

clusters = hdbscan.HDBSCAN(min_cluster_size=75, min_samples=60, 
                           cluster_selection_method ='eom', gen_min_span_tree=True, 
                           prediction_data=True).fit(coordinates)
Obtained DBCV Score:  0.2580606238793024

When using sklearn's GridSearchCV it chooses model parameters that obtain a lower DBCV value, even though the manually chosen parameters are in the dictionary of parameters. As an aside, while playing around with the RandomizedSearchCV I was able to obtain a DBCV value of 0.28 using a different range of parameters, but didn't write down what parameters were utilized.

*Update: When I run the RandomizedSearchCV & GridSearchCV the 'best' model chosen is the first item in the parameter grid or the first chosen random sample. For example, in the code below, it always picks the first entries in min_samples & min_cluster_size. I suspect because it encounters an error. When I add error_score="raise" it raises a TypeError, which is likely related to the fact that it can't compare to a y, but this is unsupervised clustering with not data labels.

TypeError: _BaseScorer.call() missing 1 required positional argument: 'y_true'

    from sklearn.model_selection import RandomizedSearchCV
    from sklearn.model_selection import GridSearchCV
    import hdbscan
    from sklearn.metrics import make_scorer
    import logging # to further silence deprecation warnings
    logging.captureWarnings(True)
    
    # ### GridSearch CV Model Tuning ###
    logging.captureWarnings(True)
    hdb = hdbscan.HDBSCAN(gen_min_span_tree=True).fit(coordinates)
    
    # # specify parameters to sample from
    grid = {'min_samples': [50,55,60,65,70,75,80,90,100,110],
                  'min_cluster_size':[40,45,50,55,60,65,75,80,85,90,95,100],  
                  'cluster_selection_method' : ['eom','leaf'],
                  'metric' : ['euclidean','manhattan'] 
                 }
    #validity_scroer = "hdbscan__hdbscan___HDBSCAN__validity_index"
    validity_scorer = make_scorer(hdbscan.validity.validity_index,greater_is_better=True)
    
    grid_search = GridSearchCV(hdb
                               ,param_grid=grid
                               ,scoring=validity_scorer)
    
    grid_search.fit(coordinates)
    
    
    print(f"Best Parameters {grid_search.best_params_}")
    print(f"DBCV score :{grid_search.best_estimator_.relative_validity_}")
Best Parameters {'cluster_selection_method': 'eom', 'metric': 'euclidean', 'min_cluster_size': 40, 'min_samples': 50}
DBCV score :0.22213170637127946

Solution

  • # Naive grid search implementation by Mueller and Guido, Introduction to Machine Learning with Python
    
    
    best_score = 0
    
    for min_cluster_size in [40,45,120,50,55,130,140,150,155,160]:
        for min_samples in [40,45,50,85,55,60,90,100,110,115,120]:
            for cluster_selection_method in ['eom','leaf']:
                for metric in ['euclidean']:
                    # for each combination of parameters of hdbscan
                    hdb = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size,min_samples=min_samples,
                                          cluster_selection_method=cluster_selection_method, metric=metric, 
                                          gen_min_span_tree=True).fit(coordinates)
                    # DBCV score
                    score = hdb.relative_validity_
                    # if we got a better DBCV, store it and the parameters
                    if score > best_score:
                        best_score = score
                        best_parameters = {'min_cluster_size': min_cluster_size, 
                                   ' min_samples':  min_samples, 'cluster_selection_method': cluster_selection_method,
                                  'metric': metric}
    
    print("Best DBCV score: {:.3f}".format(best_score))
    print("Best parameters: {}".format(best_parameters))
    

    Outputs:

    Best DBCV score: 0.414 Best parameters: {'min_cluster_size': 150, ' min_samples': 90, 'cluster_selection_method': 'eom', 'metric': 'euclidean'}