scikit-learnstatsmodelslog-likelihood

Different values of the log-loss in statsmodels and sklearn


The libraries statsmodels and sklearn produce different values of the log-loss function. A toy example:

import pandas as pd
import statsmodels.api as sm
from sklearn.metrics import log_loss

df = pd.DataFrame(
    columns=['y','x1','x2'],
    data=[
        [1,3,5],
        [1,-2,7],
        [0,-1,-5],
        [0,2,3],
        [0,3,5],
    ])

logit = sm.Logit(df.y,df.drop(columns=['y']))

res = logit.fit()

The result of res.llf is -1.386294361119906, while the result of -log_loss(df.y,res.fittedvalues) is -6.907755278982137. Shouldn't they be equal (up to a small difference due to different numerical implementations)? The statsmodels documentation says that .llf is the log likelihood of the model and as this question and this Kaggle post point out, log_loss is just the negative of the log likelihood.

Package versions: scikit-learn==1.0.1, statsmodels==0.13.5


Solution

  • As you can see, res.fittedvalues returns some negative values. If you want the prediction for your values, you should use res.predict() instead (values between 0 and 1).
    You can calculate the log-loss in the following ways:
    1. Using sklearn log_loss:

    log_loss(df.y, res.predict())
    --> 0.27725887222398127
    

    2. Using statsmodels:

    res.mle_retvals['fopt']
    --> 0.27725887222398116
    # or
    res.llf / res.nobs
    --> -0.27725887222398116
    

    The very small difference is due to calculation rounding.

    Note: In order to get the predicted values from res.fittedvalues you need to apply the expit function (inverse of logit):

    from scipy.special import expit
    
    expit(res.fittedvalues)
    

    This returns the same predictions as res.predict().