I want to optimize parameters for my multiclass classification LGBM model with RandomizedSearchCV by using a custom scoring function. This custom scoring function needs additional data that must not be used for training, however it is needed for calculating the score.
I have my features_train
dataframe, which has all the features that must be used for training plus the additional data that is needed for calculating the score, and my target_train
series.
I define
import lightgbm as lgb
random_search = RandomizedSearchCV(
lgb.LGBMClassifier(),
param_distributions=param_dist,
cv=5,
scoring=get_scoring_function(),
n_iter=100,
random_state=41,
n_jobs=30,
verbose=0
)
where
from sklearn.metrics import make_scorer
def get_scoring_function():
def lgbm_scorer(clf, X, y):
scoring_info: List[List[float]] = X[self.scoring_info_cols].values.tolist()
X = X.drop(columns=self.scoring_info_cols)
custom_metric = get_custom_metric(scoring_info=scoring_info)
dataset = lgb.Dataset(X, label=y)
preds = clf.predict(X)
return custom_metric(preds, dataset)[1]
return make_scorer(lgbm_scorer, greater_is_better=True)[1]
Where get_custom_metric
is defined as:
def get_custom_metric(scoring_info: List[List[float]]) -> Callable:
def my_metric(y_pred: np.ndarray, y_true: lgb.Dataset) -> Tuple[str, float, bool]:
y_labels: np.ndarray = y_true.get_label()
y_pred_classes = np.argmax(y_pred, axis=1)
fold_indices = y_true.get_data().index
these_scoring: List[List[float]] = [scoring_info[i] for i in fold_indices]
all_scores: List[float] = [these_scoring[i][y_pred_classes[i]] - these_scoring[i][int(y_labels[i])] for i in range(len(y_labels))]
return "MyMetric", sum(all_scores), True
return my_metric
When I run random_search.fit(features_train, target_train)
, I get the error:
joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
File "~/anaconda3/lib/python3.9/site-packages/joblib/externals/loky/backend/queues.py", line 159, in _feed
obj_ = dumps(obj, reducers=reducers)
File "~/anaconda3/lib/python3.9/site-packages/joblib/externals/loky/backend/reduction.py", line 215, in dumps
dump(obj, buf, reducers=reducers, protocol=protocol)
File "~/anaconda3/lib/python3.9/site-packages/joblib/externals/loky/backend/reduction.py", line 208, in dump
_LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj)
File "~/anaconda3/lib/python3.9/site-packages/joblib/externals/cloudpickle/cloudpickle_fast.py", line 632, in dump
return Pickler.dump(self, obj)
ValueError: ctypes objects containing pointers cannot be pickled
"""
This error is caused by the fact that lgbm_scorer
is not "pickleable", and this is probably due to the fact that lgbm_scorer
is a complex nested function.
Any idea how to fix this? I could simplify the function by passing my additional scoring_info
to my_metric
without having to define the external function get_custom_metric
. Any idea on how to do that, WITHOUT using the additional scoring_info
as features for the model?
I'm not sure about the pickling error and whether it actually has to do with the custom metric functions.
But I think passing scoring_info
columns into the scorer but not the model itself is straightforward:
dropper = ColumnTransformer(
[('drop', "drop", scoring_info_cols)],
remainder="passthrough",
)
model = Pipeline([
('drop_scoring_info', dropper),
('lgbm', LGBMClassifier()),
])
random_search = RandomizedSearchCV(
model,
...,
)
You will probably then want to not use the convenience function make_scorer
, because that turns a metric with signature (y_test, y_pred)
into a scorer with signature (estimator, X_test, y_test)
. Since you want to have access to the entire X_test
, just define such a scorer directly.