I'm currently trying to use HDBSCAN to cluster a bunch of movie data, in order to group similar content together and be able to come up with 'topics' that describe those clusters. I'm interested in HDBSCAN because I'm aware that it's considered soft clustering, as opposed to K-Means, which would be more suitable for my goal.
After performing HDBSCAN, I was able to find with movies were placed in each cluster. What I now wanted was to which terms/words represented each cluster.
I've done something similar with KMeans (code below):
model = KMeans(n_clusters=70)
model.fit(text)
clusters=model.predict(text)
model_labels=model.labels_
output= model.transform(text)
titles=[]
for i in data['title']:
titles.append(i)
genres=[]
for i in data['genres']:
genres.append(i)
films_kmeans = { 'title': titles, 'info': dataset_list2, 'cluster': clusters, 'genre': genres }
frame_kmeans= pd.DataFrame(films_kmeans, index=[clusters])
print("Top terms per cluster:")
print()
#sort cluster centers by proximity to centroid
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
for i in range(70):
print("Cluster %d:" % i),
for ind in order_centroids[i, :5]:
print(' %s' % tfidf_feature_names[ind]),
print()
print()
print("Cluster %d titles:" % i, end='')
for title in frame_kmeans.loc[i]['title'].values.tolist():
print(' %s,' % title, end='')
print() #add whitespace
print() #add whitespace
print()
While this works fine for KMeans, I couldn't find a similar way to do this for HDBSCAN, as I'm aware it doesn't have cluster centers. I have been looking at the documentation, but I'm pretty new at this and I haven't been able to fix my issue.
Any ideas would be very much appreciated! Thank you for your time.
I ran into a similar problem and taking the lead from @ajmartin's advice, the code below worked for me.
Assuming you have a list of labels - label
containing the original labels for each point and an HDBSCAN object, clusterer = hdbscan.HDBSCAN(min_cluster_size=10).fit(X)
,
from operator import itemgetter
from collections import defaultdict
def get_top_terms(k):
top_terms = defaultdict(list)
for c_lab, prob, text_lab in zip(clusterer.labels_, clusterer.probabilities_, label):
top_terms[c_lab].append((prob, text_lab))
for c_lab in top_terms:
top_terms[c_lab].sort(reverse=True, key=itemgetter(0)) # sort the pair based on probability
# -- print the top k terms per cluster --
for c_lab in top_terms:
print(c_lab, top_terms[c_lab][:k])
return top_terms
# -- for visualization (add this snippet before plt.scatter(..))--
from collections import Counter
plt.figure(figsize=(16, 16))
plt.title('min_cluster_size=10')
plot_top=Counter() # to get only distinct labels, replace with a set and add a check here [1]
top_terms = get_top_terms(10)
for i, lab, prob in zip(range(len(clusterer.labels_)),clusterer.labels_, clusterer.probabilities_): # pointwise iteration
if plot_top[lab] < 10:
for el in top_terms[lab][:10]:
if prob == el[0]: # [1]
plot_top[lab] += 1
# x[i], y[i] are the projected points in 2D space
plt.annotate(el[1], (x[i],y[i]), horizontalalignment='center', verticalalignment='center', size=9.5)
break