So I have some string data that I do some manipulations to and then create a cluster with using HDBSCAN:
textData = train['eudexHash'].apply(lambda x: str(x))
clusterer = hdbscan.HDBSCAN(min_cluster_size=5,
gen_min_span_tree=True,
prediction_data=True).fit(textData.values.reshape(-1,1))
Now, when I call the cluster to predict using approximate_predict, I get these results:
>>>> hdbscan.approximate_predict(clusterer, testCase)
(array([113]), array([1.]))
Sweet, looks like it's predicting new cases, so it thinks that the new string value corresponds to the label [113]. Now, how do I find what other members are within that label/bucket/cluster?
Cheers!
If you want to find out which of your training data is part of label 113, then you can just do
textdata_with_label_113 = textData[clusterer.labels_ == 113]