I would like to cluster using GMM the classical iris dataset. I got the dataset from:
https://gist.github.com/netj/8836201
and my program so far is the following:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture as mix
from sklearn.cross_validation import StratifiedKFold
def main():
data=pd.read_csv("iris.csv",header=None)
data=data.iloc[1:]
data[4]=data[4].astype("category")
data[4]=data[4].cat.codes
target=np.array(data.pop(4))
X=np.array(data).astype(float)
kf=StratifiedKFold(target,n_folds=10,shuffle=True,random_state=1234)
train_ind,test_ind=next(iter(kf))
X_train=X[train_ind]
y_train=target[train_ind]
gmm_calc(X_train,"full",y_train)
def gmm_calc(X_train,cov,y_train):
print X_train
print y_train
n_classes = len(np.unique(y_train))
model=mix(n_components=n_classes,covariance_type="full")
model.means_ = np.array([X_train[y_train == i].mean(axis=0) for i in
xrange(n_classes)])
model.fit(X_train)
y_predict=model.predict(X_train)
print cov," ",y_train
print cov," ",y_predict
print (np.mean(y_predict==y_train))*100
The problem I got is when I try to get the number of coincidences y_predict=y_train, because every time I run the program I get different results. For example:
First run:
full [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
full [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2
2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
0.0
Second run:
full [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
full [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0
0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
33.33333333333333
Third run:
full [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
full [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1
1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
98.51851851851852
So, as you can see the results differ with every run. I found some code over the Internet that is in:
https://scikit-learn.org/0.16/auto_examples/mixture/plot_gmm_classifier.html
But they got with the full co-variance an accuracy of approximately 82% for the train set. What am I doing wrong in this case?
Thanks
Update: I found that on the internet example it was used GMM instead of the new GaussianMixture. I also found that in the example the GMM parameters were initialized in a supervised way with: classifier.means_ = np.array([X_train[y_train == i].mean(axis=0) for i in xrange(n_classes)])
I have put the modified code above, but still it changes the results every time that I run it, but with the library GMM it does not happen.
1) GMM classifier uses Expectation–maximization algorithm to fit a mixture of gaussian models: gaussian components are randomly centered on data points, then algorithm moves them until it converges to local optimum. Because of the random initializtion results can be different each run. You therefore have to use random_state
parameter of GMM
also (or try to set higher number of initializations n_init
and expect more similiar results.)
2) The issue with accuracy happens because GMM
(same as kmeans
) just fits n
gaussians and reports a gaussian component "number" to which each point belongs; this number differs in every run. You can see in your predictions, that clusters are the same, but their labels are swapped: (1,2,0) -> (1,0,2) -> (0,1,2), the last combination coincides with proper classes so you get 98% score. If you plot them you can see that gaussians themselves tend to stay the same in this case e.g.
You could use a number of clustering metrics that take this into account:
>>> [round(i,5) for i in (metrics.homogeneity_score(y_predict, y_train),
metrics.completeness_score(y_predict, y_train),
metrics.v_measure_score(y_predict,y_train),
metrics.adjusted_rand_score(y_predict, y_train),
metrics.adjusted_mutual_info_score(y_predict, y_train))]
[0.86443, 0.8575, 0.86095, 0.84893, 0.85506]
Code for plotting, from https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_covariances.html, note that the code differs between versions, if you use the old one you need to replace make_ellipses
function:
model = mix(n_components=len(np.unique(y_train)), covariance_type="full", verbose=0, n_init=100)
X_train = X_train.astype(float)
model.fit(X_train)
y_predict = model.predict(X_train)
import matplotlib as mpl
import matplotlib.pyplot as plt
def make_ellipses(gmm, ax):
for n, color in enumerate(['navy', 'turquoise', 'darkorange']):
if gmm.covariance_type == 'full':
covariances = gmm.covariances_[n][:2, :2]
elif gmm.covariance_type == 'tied':
covariances = gmm.covariances_[:2, :2]
elif gmm.covariance_type == 'diag':
covariances = np.diag(gmm.covariances_[n][:2])
elif gmm.covariance_type == 'spherical':
covariances = np.eye(gmm.means_.shape[1]) * gmm.covariances_[n]
v, w = np.linalg.eigh(covariances)
u = w[0] / np.linalg.norm(w[0])
angle = np.arctan2(u[1], u[0])
angle = 180 * angle / np.pi # convert to degrees
v = 2. * np.sqrt(2.) * np.sqrt(v)
ell = mpl.patches.Ellipse(gmm.means_[n, :2], v[0], v[1],
180 + angle, color=color)
ell.set_clip_box(ax.bbox)
ell.set_alpha(0.5)
ax.add_artist(ell)
def plot(model, X, y, y_predict):
h = plt.subplot(1, 1, 1)
plt.subplots_adjust(bottom=.01, top=0.95, hspace=.15, wspace=.05,
left=.01, right=.99)
make_ellipses(model, h)
for n, color in enumerate( ['navy', 'turquoise', 'darkorange']):
plt.scatter(X[y == n][:,0], X[y == n][:,1], color=color,marker='x')
plt.text(0.05, 0.9, 'Accuracy: %.1f' % ((np.mean(y_predict == y)) * 100),
transform=h.transAxes)
plt.show()
plot(model, X_train, y_train, y_predict)