pythonscikit-learncluster-analysisdata-analysisgmm

How to get log-likelihood for each iteration in sklearn GMM?


I am trying to fit a GMM in sklearn and i see that the model converges at around epoch 3 but i cannot seems to access the log-likelihood score computed at each epoch.

from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=4, tol=1e-8).fit(data)

Is there a way to do access the log-likelihood scores somehow for each epoch?


Solution

  • If you just want to look at the loglik scores, you can set verbose=2 to print the change in loglik and verbose_interval=1 to capture it at every step:

    from sklearn.mixture import GaussianMixture
    gmm = GaussianMixture(n_components=3, tol=1e-8,verbose=2,verbose_interval=1)
    gmm.fit(data)
    
    Initialization 0
      Iteration 1    time lapse 0.00560s     ll change inf
      Iteration 2    time lapse 0.00134s     ll change 0.03655
      Iteration 3    time lapse 0.00119s     ll change 0.00867
      Iteration 4    time lapse 0.00118s     ll change 0.00619
      Iteration 5    time lapse 0.00116s     ll change 0.00612
      Iteration 6    time lapse 0.00125s     ll change 0.00647
      Iteration 7    time lapse 0.00128s     ll change 0.00700
      Iteration 8    time lapse 0.00127s     ll change 0.00727
      Iteration 9    time lapse 0.00126s     ll change 0.00673
      Iteration 10   time lapse 0.00117s     ll change 0.00604
      Iteration 11   time lapse 0.00109s     ll change 0.00530
      Iteration 12   time lapse 0.00125s     ll change 0.00431
      Iteration 13   time lapse 0.00121s     ll change 0.00366
      Iteration 14   time lapse 0.00123s     ll change 0.00404
      Iteration 15   time lapse 0.00130s     ll change 0.00361
      Iteration 16   time lapse 0.00118s     ll change 0.00157
      Iteration 17   time lapse 0.00124s     ll change 0.00048
      Iteration 18   time lapse 0.00126s     ll change 0.00015
      Iteration 19   time lapse 0.00115s     ll change 0.00005
      Iteration 20   time lapse 0.00116s     ll change 0.00001
      Iteration 21   time lapse 0.00124s     ll change 0.00000
      Iteration 22   time lapse 0.00122s     ll change 0.00000
      Iteration 23   time lapse 0.00142s     ll change 0.00000
      Iteration 24   time lapse 0.00126s     ll change 0.00000
      Iteration 25   time lapse 0.00124s     ll change 0.00000
      Iteration 26   time lapse 0.00122s     ll change 0.00000
      Iteration 27   time lapse 0.00120s     ll change 0.00000
    Initialization converged: True   time lapse 0.03765s     ll -1.20124
    

    To actually capture this value, depending on what you are using, you either write it to a log using logging , or for example below, in a jupyter notebook, this might work:

    %%capture cap --no-stderr
    gmm.fit(data)
    

    Then we pass it into a dataframe and try to back calculate the likelihood:

    res = pd.DataFrame([i.split() for i in cap.stdout.split("\n")]).iloc[:,[1,7]]
    res.columns = ['iteration','change']
    res.change = res.change.astype('float64')
    res = res[np.isfinite(res.change)]
    res['logLik'] = res['change'].values[-1]
    res.loc[:len(res),['logLik']] = -res.change[:-1][::-1].cumsum()[::-1] + res.change.values[-1]
    res
    
    
        iteration   change  logLik
    2   2   0.03655 -1.31546
    3   3   0.00867 -1.27891
    4   4   0.00619 -1.27024
    5   5   0.00612 -1.26405
    6   6   0.00647 -1.25793
    7   7   0.00700 -1.25146
    8   8   0.00727 -1.24446
    9   9   0.00673 -1.23719
    10  10  0.00604 -1.23046
    11  11  0.00530 -1.22442
    12  12  0.00431 -1.21912
    13  13  0.00366 -1.21481
    14  14  0.00404 -1.21115
    15  15  0.00361 -1.20711
    16  16  0.00157 -1.20350
    17  17  0.00048 -1.20193
    18  18  0.00015 -1.20145
    19  19  0.00005 -1.20130
    20  20  0.00001 -1.20125
    21  21  0.00000 -1.20124
    22  22  0.00000 -1.20124
    23  23  0.00000 -1.20124
    24  24  0.00000 -1.20124
    25  25  0.00000 -1.20124
    26  26  0.00000 -1.20124
    27  27  0.00000 -1.20124
    28  converged:  -1.20124    -1.20124