pythonscikit-learngmm

different results obtained with GMM


I would like to cluster using GMM the classical iris dataset. I got the dataset from:

https://gist.github.com/netj/8836201

and my program so far is the following:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture as mix
from sklearn.cross_validation import StratifiedKFold

def main():
    data=pd.read_csv("iris.csv",header=None)

    data=data.iloc[1:]

    data[4]=data[4].astype("category")

    data[4]=data[4].cat.codes

    target=np.array(data.pop(4))
    X=np.array(data).astype(float)


    kf=StratifiedKFold(target,n_folds=10,shuffle=True,random_state=1234)

    train_ind,test_ind=next(iter(kf))
    X_train=X[train_ind]
    y_train=target[train_ind]

    gmm_calc(X_train,"full",y_train)

def gmm_calc(X_train,cov,y_train):
    print X_train
    print y_train
    n_classes = len(np.unique(y_train))
    model=mix(n_components=n_classes,covariance_type="full")
    model.means_ = np.array([X_train[y_train == i].mean(axis=0) for i in 
 xrange(n_classes)])
    model.fit(X_train)
    y_predict=model.predict(X_train)
    print cov," ",y_train
    print cov," ",y_predict
    print (np.mean(y_predict==y_train))*100

The problem I got is when I try to get the number of coincidences y_predict=y_train, because every time I run the program I get different results. For example:

First run:

full   [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
full   [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2
 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
0.0

Second run:

full   [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
full   [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0
 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
33.33333333333333

Third run:

full   [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
full   [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1
 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
98.51851851851852

So, as you can see the results differ with every run. I found some code over the Internet that is in:

https://scikit-learn.org/0.16/auto_examples/mixture/plot_gmm_classifier.html

But they got with the full co-variance an accuracy of approximately 82% for the train set. What am I doing wrong in this case?

Thanks

Update: I found that on the internet example it was used GMM instead of the new GaussianMixture. I also found that in the example the GMM parameters were initialized in a supervised way with: classifier.means_ = np.array([X_train[y_train == i].mean(axis=0) for i in xrange(n_classes)])

I have put the modified code above, but still it changes the results every time that I run it, but with the library GMM it does not happen.


Solution

  • 1) GMM classifier uses Expectation–maximization algorithm to fit a mixture of gaussian models: gaussian components are randomly centered on data points, then algorithm moves them until it converges to local optimum. Because of the random initializtion results can be different each run. You therefore have to use random_state parameter of GMM also (or try to set higher number of initializations n_init and expect more similiar results.)

    2) The issue with accuracy happens because GMM (same as kmeans) just fits n gaussians and reports a gaussian component "number" to which each point belongs; this number differs in every run. You can see in your predictions, that clusters are the same, but their labels are swapped: (1,2,0) -> (1,0,2) -> (0,1,2), the last combination coincides with proper classes so you get 98% score. If you plot them you can see that gaussians themselves tend to stay the same in this case e.g.enter image description here You could use a number of clustering metrics that take this into account:

    >>> [round(i,5) for i in  (metrics.homogeneity_score(y_predict, y_train),
     metrics.completeness_score(y_predict, y_train),
     metrics.v_measure_score(y_predict,y_train),
     metrics.adjusted_rand_score(y_predict, y_train),
     metrics.adjusted_mutual_info_score(y_predict,  y_train))]
    [0.86443, 0.8575, 0.86095, 0.84893, 0.85506]
    

    Code for plotting, from https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_covariances.html, note that the code differs between versions, if you use the old one you need to replace make_ellipses function:

    model = mix(n_components=len(np.unique(y_train)), covariance_type="full", verbose=0, n_init=100)
    X_train = X_train.astype(float)
    model.fit(X_train)
    y_predict = model.predict(X_train)
    
    import matplotlib as mpl
    import matplotlib.pyplot as plt
    
    def make_ellipses(gmm, ax):
        for n, color in enumerate(['navy', 'turquoise', 'darkorange']):
            if gmm.covariance_type == 'full':
                covariances = gmm.covariances_[n][:2, :2]
            elif gmm.covariance_type == 'tied':
                covariances = gmm.covariances_[:2, :2]
            elif gmm.covariance_type == 'diag':
                covariances = np.diag(gmm.covariances_[n][:2])
            elif gmm.covariance_type == 'spherical':
                covariances = np.eye(gmm.means_.shape[1]) * gmm.covariances_[n]
            v, w = np.linalg.eigh(covariances)
            u = w[0] / np.linalg.norm(w[0])
            angle = np.arctan2(u[1], u[0])
            angle = 180 * angle / np.pi  # convert to degrees
            v = 2. * np.sqrt(2.) * np.sqrt(v)
            ell = mpl.patches.Ellipse(gmm.means_[n, :2], v[0], v[1],
                                      180 + angle, color=color)
            ell.set_clip_box(ax.bbox)
            ell.set_alpha(0.5)
            ax.add_artist(ell)
    
    
    def plot(model, X, y, y_predict):
    
        h = plt.subplot(1, 1, 1)
        plt.subplots_adjust(bottom=.01, top=0.95, hspace=.15, wspace=.05,
                        left=.01, right=.99)
        make_ellipses(model, h)
        for n, color in enumerate( ['navy', 'turquoise', 'darkorange']):
            plt.scatter(X[y == n][:,0], X[y == n][:,1],  color=color,marker='x')
            plt.text(0.05, 0.9, 'Accuracy: %.1f' % ((np.mean(y_predict == y)) * 100),
                     transform=h.transAxes)
    
        plt.show()
    plot(model, X_train, y_train, y_predict)