pythonmachine-learningcluster-analysisk-meanshdbscan

Retrieving members of a cluster with HDBSCAN


So I have some string data that I do some manipulations to and then create a cluster with using HDBSCAN:

textData = train['eudexHash'].apply(lambda x: str(x))
clusterer = hdbscan.HDBSCAN(min_cluster_size=5,
                            gen_min_span_tree=True,
                            prediction_data=True).fit(textData.values.reshape(-1,1))

Now, when I call the cluster to predict using approximate_predict, I get these results:

>>>> hdbscan.approximate_predict(clusterer, testCase)
(array([113]), array([1.]))

Sweet, looks like it's predicting new cases, so it thinks that the new string value corresponds to the label [113]. Now, how do I find what other members are within that label/bucket/cluster?

Cheers!


Solution

  • If you want to find out which of your training data is part of label 113, then you can just do

    textdata_with_label_113 = textData[clusterer.labels_ == 113]