The libraries statsmodels and sklearn produce different values of the log-loss function. A toy example:
import pandas as pd
import statsmodels.api as sm
from sklearn.metrics import log_loss
df = pd.DataFrame(
columns=['y','x1','x2'],
data=[
[1,3,5],
[1,-2,7],
[0,-1,-5],
[0,2,3],
[0,3,5],
])
logit = sm.Logit(df.y,df.drop(columns=['y']))
res = logit.fit()
The result of res.llf
is -1.386294361119906, while the result of -log_loss(df.y,res.fittedvalues)
is -6.907755278982137. Shouldn't they be equal (up to a small difference due to different numerical implementations)? The statsmodels documentation says that .llf
is the log likelihood of the model and as this question and this Kaggle post point out, log_loss is just the negative of the log likelihood.
Package versions: scikit-learn==1.0.1
, statsmodels==0.13.5
As you can see, res.fittedvalues
returns some negative values. If you want the prediction for your values, you should use res.predict()
instead (values between 0 and 1).
You can calculate the log-loss in the following ways:
1. Using sklearn log_loss:
log_loss(df.y, res.predict())
--> 0.27725887222398127
2. Using statsmodels:
res.mle_retvals['fopt']
--> 0.27725887222398116
# or
res.llf / res.nobs
--> -0.27725887222398116
The very small difference is due to calculation rounding.
Note: In order to get the predicted values from res.fittedvalues
you need to apply the expit function (inverse of logit):
from scipy.special import expit
expit(res.fittedvalues)
This returns the same predictions as res.predict()
.