logistic-regressiondaskdask-ml

Different results from scikit-learn and dask-ml LogisticRegression


When running the same LogisticRegression with the same data, results should not differ between scikit-learn and dask-ml implementation.

Versions: scikit-learn=0.21.2
dask-ml=1.0.0

First with dask-ml LogisticRegression:

from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn import metrics
from dask_yarn import YarnCluster
from dask.distributed import Client
from dask_ml.linear_model import LogisticRegression
import dask.dataframe as dd
import dask.array as da
digits = load_digits()
x_train, x_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.25, random_state=0)
lr = LogisticRegression(solver_kwargs={"normalize":False})
lr.fit(x_train, y_train)
score = lr.score(x_test, y_test)
print(score)
predictions = lr.predict(x_test)
cm = metrics.confusion_matrix(y_test, predictions)
print(cm)

And now with sklearn LogisticRegression :

from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn import metrics
from dask_yarn import YarnCluster
from dask.distributed import Client
from sklearn.linear_model import LogisticRegression
import dask.dataframe as dd
import dask.array as da
digits = load_digits()
x_train, x_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.25, random_state=0)
lr = LogisticRegression()
lr.fit(x_train, y_train)
score = lr.score(x_test, y_test)
print(score)
predictions = lr.predict(x_test)
cm = metrics.confusion_matrix(y_test, predictions)
print(cm)

Score and Convolution matrix for scikit-learn

0.9533333333333334
[[37  0  0  0  0  0  0  0  0  0]
 [ 0 39  0  0  0  0  2  0  2  0]
 [ 0  0 41  3  0  0  0  0  0  0]
 [ 0  0  1 43  0  0  0  0  0  1]
 [ 0  0  0  0 38  0  0  0  0  0]
 [ 0  1  0  0  0 47  0  0  0  0]
 [ 0  0  0  0  0  0 52  0  0  0]
 [ 0  1  0  1  1  0  0 45  0  0]
 [ 0  3  1  0  0  0  0  0 43  1]
 [ 0  0  0  1  0  1  0  0  1 44]]

Score and Convolution matrix for dask-ml

0.09555555555555556
[[ 0 37  0  0  0  0  0  0  0  0]
 [ 0 43  0  0  0  0  0  0  0  0]
 [ 0 44  0  0  0  0  0  0  0  0]
 [ 0 45  0  0  0  0  0  0  0  0]
 [ 0 38  0  0  0  0  0  0  0  0]
 [ 0 48  0  0  0  0  0  0  0  0]
 [ 0 52  0  0  0  0  0  0  0  0]
 [ 0 48  0  0  0  0  0  0  0  0]
 [ 0 48  0  0  0  0  0  0  0  0]
 [ 0 47  0  0  0  0  0  0  0  0]]

Solution

  • Dask-ml, as of version dask_ml==1.0.0, doesn't support logistic regression with multiple classes. Using a slightly modified version of your original example, if you print predictions from the fitted dask-ml LogisticRegression classifier, you'll see it gives a boolean array filled with True.

    from sklearn.datasets import load_digits
    from dask_ml.linear_model import LogisticRegression
    
    X, y = load_digits(return_X_y=True)
    lr = LogisticRegression(solver_kwargs={"normalize": False})
    lr.fit(X, y)
    predictions = lr.predict(X)
    print('predictions = {}'.format(predictions))
    

    outputs

    predictions = [ True  True  True ...  True  True  True]
    

    This is why the dask-ml and scikit-learn confusion matrices differ from one another.

    There's a related open issue for this on GitHub at https://github.com/dask/dask-ml/issues/386