LGBM custom scoring function to RandomizedSearchCV

I want to optimize parameters for my multiclass classification LGBM model with RandomizedSearchCV by using a custom scoring function. This custom scoring function needs additional data that must not be used for training, however it is needed for calculating the score.

I have my features_train dataframe, which has all the features that must be used for training plus the additional data that is needed for calculating the score, and my target_train series.

I define

import lightgbm as lgb


random_search = RandomizedSearchCV(
    lgb.LGBMClassifier(),
    param_distributions=param_dist,
    cv=5,
    scoring=get_scoring_function(),
    n_iter=100,
    random_state=41,
    n_jobs=30,
    verbose=0
)

where

from sklearn.metrics import make_scorer

def get_scoring_function():

    def lgbm_scorer(clf, X, y):
        scoring_info: List[List[float]] = X[self.scoring_info_cols].values.tolist()
        X = X.drop(columns=self.scoring_info_cols)
        custom_metric = get_custom_metric(scoring_info=scoring_info)
        dataset = lgb.Dataset(X, label=y)
        preds = clf.predict(X)
        return custom_metric(preds, dataset)[1]

    return make_scorer(lgbm_scorer, greater_is_better=True)[1]

Where get_custom_metric is defined as:

def get_custom_metric(scoring_info: List[List[float]]) -> Callable:
    def my_metric(y_pred: np.ndarray, y_true: lgb.Dataset) -> Tuple[str, float, bool]:
        y_labels: np.ndarray = y_true.get_label()
        y_pred_classes = np.argmax(y_pred, axis=1)
        fold_indices = y_true.get_data().index
        these_scoring: List[List[float]] = [scoring_info[i] for i in fold_indices]
        all_scores: List[float] = [these_scoring[i][y_pred_classes[i]] - these_scoring[i][int(y_labels[i])] for i in range(len(y_labels))]
        return "MyMetric", sum(all_scores), True
    return my_metric

When I run random_search.fit(features_train, target_train), I get the error:

joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
  File "~/anaconda3/lib/python3.9/site-packages/joblib/externals/loky/backend/queues.py", line 159, in _feed
    obj_ = dumps(obj, reducers=reducers)
  File "~/anaconda3/lib/python3.9/site-packages/joblib/externals/loky/backend/reduction.py", line 215, in dumps
    dump(obj, buf, reducers=reducers, protocol=protocol)
  File "~/anaconda3/lib/python3.9/site-packages/joblib/externals/loky/backend/reduction.py", line 208, in dump
    _LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj)
  File "~/anaconda3/lib/python3.9/site-packages/joblib/externals/cloudpickle/cloudpickle_fast.py", line 632, in dump
    return Pickler.dump(self, obj)
ValueError: ctypes objects containing pointers cannot be pickled
"""

This error is caused by the fact that lgbm_scorer is not "pickleable", and this is probably due to the fact that lgbm_scorer is a complex nested function.

Any idea how to fix this? I could simplify the function by passing my additional scoring_info to my_metric without having to define the external function get_custom_metric. Any idea on how to do that, WITHOUT using the additional scoring_info as features for the model?

Solution

I'm not sure about the pickling error and whether it actually has to do with the custom metric functions.

But I think passing scoring_info columns into the scorer but not the model itself is straightforward:

dropper = ColumnTransformer(
    [('drop', "drop", scoring_info_cols)],
    remainder="passthrough",
)
model = Pipeline([
    ('drop_scoring_info', dropper),
    ('lgbm', LGBMClassifier()),
])
random_search = RandomizedSearchCV(
    model,
    ...,
)

You will probably then want to not use the convenience function make_scorer, because that turns a metric with signature (y_test, y_pred) into a scorer with signature (estimator, X_test, y_test). Since you want to have access to the entire X_test, just define such a scorer directly.