I am trying to train a RandomForestClassifier with a custom scorer whose output needs to be dependent on one of the features.
The X dataset contains 18 features:
The y is the usual array of 0s and 1s:
The RandomForestClassifier with custom scorer is used within a GridSearchCV instance: GridSearchCV(classifier, param_grid=[...], scoring=custom_scorer).
Custom scorer is defined via Scikit-learn function make_scorer: custom_scorer = make_scorer(custom_scorer_function, greater_is_better=True).
This framework is very straightforward if the custom_scorer_function is dependent only on y_true and y_pred. However in my case I need to define a scorer which makes use of one of the 18 features contained in the X dataset, i.e. depending on the values of y_pred and y_true the custom score will be a combination of them and the feature.
My question is how can I pass the feature into the custom_scorer_function given that its standard signature accepts y_true and y_pred?
I am aware it accepts extra **kwargs, but passing the entire feature array in this way doesn't solve the problem as this function is invoked for each couple of y_true and y_pred values (would need to extract the individual feature value corresponding to them to make this working, which I am not sure can be done).
I have tried to augment the y_true array packing that feature into it and unpacking it within the custom_scorer_function (1st column are the actual labels, 2nd columns are the feature values I need to calculate the custom scores):
However doing so violates the requirements of the classifier of having a 1D labels array and triggers the following error.
ValueError: Unknown label type: 'continuous-multioutput'
Any help is much appreciated.
Thank you.
You can do something like this (note you have given no real code so this is barebones)
X = [...]
y = [...]
def custom_scorer_function(y, y_pred, **kwargs):
a_feature = X[:,1]
# now have y, y_pred and the feature you want
custom_scorer = make_scorer(custom_scorer_function, greater_is_better=True)
...