I got a Warning UserWarning: HDBSCAN can only compute 3 clusters. Setting n_clusters to 3...
when I specified param n_clusters=4, using HDBSCAN_flat(). Can I get the max_eom_clusters
before I specify the param?
Actually, max_eom_clusters(max_clusters) is calculated in flat.py line 656-657:
654 # With method 'eom', max clusters are produced for epsilon=0,
655 # as computed by
656 eom_base_clusters = condensed_tree._select_clusters()
657 max_clusters = len(eom_base_clusters)
However, I can't use function _select_clusters() in my class because it is protected.
I think we get a best count of clusters when using HDBSCAN().fit(X), but not a maximum of cluster count. So it's not good to use the best count as max_eom_clusters. Am I right?
I can extract the max_eom_clusters
value from the UserWarning as blow. But I don't think it's a really good solving.
import numpy as np
import matplotlib.pyplot as plt
from hdbscan.flat import HDBSCAN_flat
import re
# simulate two dimension data
np.random.seed(0)
data = np.random.rand(100, 2)
plt.scatter(x=data[:, 0], y=data[:, 1])
plt.show()
# get max_eom_clusters by excepting UserWarning which was filtered s error
import warnings
warnings.filterwarnings('error', category=UserWarning)
try:
huge_n_clusterer = 999
clusterer = HDBSCAN_flat(X=data, n_clusters=huge_n_clusterer)
except UserWarning as e:
numbers = re.findall("[0-9]+", e.args[0])
max_eom_clusters = eval(numbers[0])
print(f"max_eom_clusters is {max_eom_clusters}") # max_eom_clusters is 3
And this is the simulation data picture: A simulation data Picture