I am trying to run GridSearchCV with the LogisticRegression estimator and record the model accuracy, precision, recall, f1 metrics.
However, I get the following error on the precision metric:
Precision is ill-defined and being set to 0.0 due to no predicted samples.
Use `zero_division` parameter to control this behavior
I understand why I am getting the error as there are no predictions with output value equal to 1 in the Kfold split. However I don't understand how I can specific set "zero_divison" as 1 in GridSearchCV (logistic_reg variable).
Original code
logistic_reg = GridSearchCV(estimator=LogisticRegression(penalty="l1", random_state=42, max_iter=10000), param_grid={
"C": [1e-4, 5e-4, 1e-3, 5e-3, 1e-2, 5e-2, 1e-1, 5e-1, 1, 5, 10, 20],
"solver": ["liblinear", "saga"]
}, scoring=["accuracy", "precision", "recall", "f1"], cv=StratifiedKFold(n_splits=10), refit="accuracy")
logistic_reg_X_train = self.X_train.copy()
logistic_reg_X_train.drop(self.columns_removed, axis=1, inplace=True)
logistic_reg.fit(logistic_reg_X_train, self.y_train)
logistic_reg_results = pd.DataFrame(logistic_reg.cv_results_)
I tried changing "precision" to precision_score(zero_division=1) but this gives me another error (missing 2 required positional arguments: 'y_true' and 'y_pred'
). Again I understand this but the 2 missing parameters are not defined before applying the fit method.
How can I specify the 1zero_division
parameter to the precision score metric?
Edit
What I don't understand is that I stratified the y data in my train_test_split method and used the StratifedKFold in the GridSearchCV. My understanding from this is that the train/test data will have the same split proportion of y values and the same should happen during cross validation. This means that in the gridsearchcv samples, the data should have y values of both 0 and 1 and thus precision cannot equal 0 (model will be able to calculate TP and FP as the sample test data contains samples where y is equal to 1). I'm not sure where to go from here.
From reading further into this issue, my understanding is that the error is occurring because not all the labels in my y_test are appearing in my y_pred. This is not the case for my data.
I used the comment from G.Anderson to remove the warning (but it doesn't answer my question)
Created new custom_scorer object
Created customer_scoring dictionary
Updated GridSearchCV scoring and refit parameters
from sklearn.metrics import precision_score, make_scorer
precision_scorer = make_scorer(precision_score, zero_division=0)
custom_scoring = {"accuracy": "accuracy", "precision": precision_scorer, "recall": "recall", "f1": "f1"}
logistic_reg = GridSearchCV(estimator=LogisticRegression(penalty="l1", random_state=42, max_iter=10000), param_grid={
"C": [1e-4, 5e-4, 1e-3, 5e-3, 1e-2, 5e-2, 1e-1, 5e-1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20],
"solver": ["liblinear", "saga"]
}, scoring=custom_scoring, cv=StratifiedKFold(n_splits=10), refit="accuracy")
Edit - Answer to Question Above
I used GridSearchCV to find the best hyperparameters for the model. To view the model metrics for each split, I create a StratifedKFold estimator with the best hyperparameters and then did cross validation on its own. This gave me no precision warning messages. I have no idea why GridSearchCV is giving me a warning but atleast this way works!!!
Note: I get the same results from the method below and GridSearchCV in the question above.
skf = StratifiedKFold(n_splits=10)
logistic_reg_class_skf = LogisticRegression(penalty="l1", max_iter=10000, random_state=42, C=5, solver="liblinear")
logistic_reg_class_score = []
for train, test in skf.split(logistic_reg_class_X_train, self.y_train):
logistic_reg_class_skf_X_train = logistic_reg_class_X_train.iloc[train]
logistic_reg_class_skf_X_test = logistic_reg_class_X_train.iloc[test]
logistic_reg_class_skf_y_train = self.y_train.iloc[train]
logistic_reg_class_skf_y_test = self.y_train.iloc[test]
logistic_reg_class_skf.fit(logistic_reg_class_skf_X_train, logistic_reg_class_skf_y_train)
logistic_reg_skf_y_pred = logistic_reg_class_skf.predict(logistic_reg_class_skf_X_test)
skf_accuracy_score = metrics.accuracy_score(logistic_reg_class_skf_y_test, logistic_reg_skf_y_pred)
skf_precision_score = metrics.precision_score(logistic_reg_class_skf_y_test, logistic_reg_skf_y_pred)
skf_recall_score = metrics.recall_score(logistic_reg_class_skf_y_test, logistic_reg_skf_y_pred)
skf_f1_score = metrics.f1_score(logistic_reg_class_skf_y_test, logistic_reg_skf_y_pred)
logistic_reg_class_score.append([skf_accuracy_score, skf_precision_score, skf_recall_score, skf_f1_score])
classification_results = pd.DataFrame({"Algorithm": ["Logistic Reg Train"], "Accuracy": [0.0], "Precision": [0.0],
"Recall": [0.0], "F1 Score": [0.0]})
for i in range (0, 10):
classification_results.loc[i] = ["Logistic Reg Train", logistic_reg_class_score[i][0], logistic_reg_class_score[i][1],
logistic_reg_class_score[2][0], logistic_reg_class_score[3][0]]