pythonhierarchical-clustering

How to get the maximum which n_clusters(param of hdbscan.flat.HDBSCAN_flat()) can specify


Question 1

I got a Warning UserWarning: HDBSCAN can only compute 3 clusters. Setting n_clusters to 3... when I specified param n_clusters=4, using HDBSCAN_flat(). Can I get the max_eom_clusters before I specify the param?

My Try

Actually, max_eom_clusters(max_clusters) is calculated in flat.py line 656-657:

654 # With method 'eom', max clusters are produced for epsilon=0,
655     #   as computed by
656 eom_base_clusters = condensed_tree._select_clusters()
657 max_clusters = len(eom_base_clusters)

However, I can't use function _select_clusters() in my class because it is protected.

Question 2

I think we get a best count of clusters when using HDBSCAN().fit(X), but not a maximum of cluster count. So it's not good to use the best count as max_eom_clusters. Am I right?


Solution

  • I can extract the max_eom_clusters value from the UserWarning as blow. But I don't think it's a really good solving.

    import numpy as np
    import matplotlib.pyplot as plt
    from hdbscan.flat import HDBSCAN_flat
    import re
    
    # simulate two dimension data
    np.random.seed(0)
    data = np.random.rand(100, 2)
    plt.scatter(x=data[:, 0], y=data[:, 1])
    plt.show()
    
    # get max_eom_clusters by excepting UserWarning which was filtered s error
    import warnings
    warnings.filterwarnings('error', category=UserWarning)
    try:
        huge_n_clusterer = 999
        clusterer = HDBSCAN_flat(X=data, n_clusters=huge_n_clusterer)
    except UserWarning as e:
        numbers = re.findall("[0-9]+", e.args[0])
        max_eom_clusters = eval(numbers[0])
        print(f"max_eom_clusters is {max_eom_clusters}") # max_eom_clusters is 3
    

    And this is the simulation data picture: A simulation data Picture