pythonclassificationlightgbm

Why LightGBM with 'objective': 'binary' donot return binary value 0 and 1 when call method predict?


I create a binary classification model with LightGBM:

#Dataset
y_train = data_train['Label']
X_train = data_train.drop(['Label'], axis=1)

y_test = data_test['Label']
X_test = data_test.drop(['Label'], axis=1)

train_data = lgb.Dataset(data=X_train, label=y_train)
test_data = lgb.Dataset(data=X_test, label=y_test)

#Setting default parameters
params_wo_constraints = {
    'objective': 'binary',
    'boosting_type': 'gbdt',
    'metric': {'binary_logloss', 'auc'},
    'num_leaves': 32,
    'max_depth ': 5,
    'min_data_in_leaf': 100,
    'seed': 42,
    'bagging_seed': 42,
    'feature_fraction_seed': 42,
    'drop_seed': 42,
    'data_random_seed': 42
    }

#Model training
evals_result = {}
model_wo_constraints = lgb.train(
    params=params_wo_constraints,
    train_set=train_data,
    )
#Prediction
train_preds_wo_constraints = model_wo_constraints.predict(X_train)
test_preds_wo_constraints = model_wo_constraints.predict(X_test)

But the value of train_preds_wo_constraints is not 0 and 1:

>>> array([7.02862608e-02, 7.02498237e-01, 4.85224849e-01, ...,
       4.00079287e-04, 1.76385121e-01, 2.09733409e-01])

I have tried sklearn API and it works well

model = lgb.LGBMClassifier(learning_rate=0.09,max_depth=5,random_state=42)
model.fit(X_train,y_train,eval_set=[(X_test,y_test),(X_train,y_train)],
          verbose=20,eval_metric='logloss')

preds_wo_constraints = model.predict(X_train)
preds_wo_constraints

>>> array([0, 1, 1, ..., 0, 0, 0])

Could anyone can help me explain why and how to solve this problem?


Solution

  • train() in the LightGBM Python package produces a lightgbm.Booster object.

    For binary classification, lightgbm.Booster.predict() by default returns the predicted probability that the target is equal to 1.

    Consider the following minimal, reproducible example using lightgbm==3.3.2 and Python 3.8.12

    import lightgbm as lgb
    from sklearn.datasets import make_blobs
    
    X, y = make_blobs(
        n_samples=1000,
        n_features=5,
        centers=2,
        random_state=708
    )
    params = {
        "objective": "binary",
        "min_data_in_leaf": 5,
        "min_data_in_bin": 5,
        "seed": 708
    }
    bst = lgb.train(
        params=params,
        train_set=lgb.Dataset(data=X, label=y),
        num_boost_round=5
    )
    
    preds = bst.predict(X)
    preds[:10]
    
    array([0.29794759, 0.70205241, 0.70205241, 0.70205241, 0.29794759,
           0.29794759, 0.29794759, 0.29794759, 0.70205241, 0.29794759])
    

    Those are the predicted probabilities that the value of the target is 1.

    In the scikit-learn interface from the lightgbm Python package, training produces an instance of lightgbm.LGBMClassifier.

    For binary classification, lightgbm.LGBMClassifier.predict() returns the predicted class.

    clf = lgb.LGBMClassifier(**params)
    clf.fit(X, y)
    preds_sklearn = clf.predict(X)
    preds_sklearn[:10]
    
    array([0, 1, 1, 1, 0, 0, 0, 0, 1, 0])
    

    explain why

    scikit-learn requires that classifiers produce predicted classes from their predict() methods.

    scikit-learn has very strict standards for writing custom estimators which are expected to be compatible with scikit-learn's features. These are described in "Developing scikit-learn estimators". The "Glossary of Common Terms and API Elements" linked from that guide says that the predict() method for scikit-learn estimators must product predictions "in the same target space used in fitting", which for classification means "one of the values in the classifier’s classes_ attribute" (docs link).

    lightgbm.train() is a lower-level interface whose goal is to provide performant, flexible control over LightGBM. It produces a Booster and Booster.predict() produces probabilities to allow users' code to choose what it wants to do with those probabilities (e.g. convert them to classes with a custom threshold, use them as sample weights for some post-processing code).

    how to solve this problem?

    To convert predicted binary classification probabilities to predicted classes, compare those probabilities to a threshold.

    pred_class = (preds > 0.5).astype("int")
    pred_class[:10]
    
    array([0, 1, 1, 1, 0, 0, 0, 0, 1, 0])