pythoncluster-analysisgmm

How to obtain GMM cluster information from model prediction?


I built a GMM model and used this to run a prediction.

bead = df['Ce140Di']
dna = df['DNA_1']
X = np.column_stack((dna, bead)) # create a 2D array from the two lists

#plt.scatter(X[:,0], X[:,1], s=0.5, c='black')
#plt.show()

gmm = GaussianMixture(n_components=4, covariance_type='tied')
gmm.fit(X)
labels = gmm.predict(X)

and then generated a plot as follows...

df['predicted_cluster'] = labels
fig= plt.figure()
colors = {1:'red', 2:'orange', 3:'purple', 0:'grey'}
plt.scatter(df['DNA_1'], df['Ce140Di'], c=df['predicted_cluster'].apply(lambda x: colors[x]), s = 0.5, alpha=0.5)
plt.show()

scatter plot colored by predictions

Whilst I have the output prediction for each row of my df, I don't actually know what cluster it corresponds to without looking at my colors dictionary, is there a way to do this without having to look at the scatter plot each time? In other words, I want to know that 0 will always correspond to my grey cluster or that 1 will always be the red cluster but this changes each time...

Colors aside, how do I know the position of each cluster? What does a label of 0 mean?

EDIT I believe the answer to my perhaps silly question is to use np.random.seed but I could be wrong...


Solution

  • Helllo Hajar,

    I think the answer to your question will disappoint you. I assume each Gaussian in your GMM is initialised to some random mean and variance. If you set a random seed then you could be reasonably certain that the resultant clusters will always be the same.

    With that said, in multi-label scenarios without a random seed there are (to my knowledge) no clustering algorithms that guarantee which label is assigned to each cluster.

    Clustering algorithms assign labels arbitrarily. The only guarantee any clustering algorithm makes about a point assigned a certain label is that it is similar to other points with the same label by some metric.

    This makes measuring the accuracy of clustering algorithms quite challenging. Hence the existence of metrics like the Adjusted Mutual Information Score and the Adjusted Rand Index.

    You could account for this with a sort of semi-supervised approach, in which you force a particular point to start with a "ground-truth" label and hope your algorithm centres a cluster on it, but even then there may be variance.

    Goodluck and I hope this helps.