pythonmachine-learningscikit-learndecision-treegridsearchcv

GridSearchCV Machine learning


I use GridSearch to find the relative best hyperparameters for this decision tree (and K-Fold CV, to evaluate the performance of the model). Please look at the line "best results" in the code and in the output results.

Why doesn't it give me any information about the criterion (e.g. whether to use entropy or Gini)?

When I ran a test with some other code I wrote, it worked, but the information provided wasn't correct (e.g. according to GridSearch, Entropy is better for this model, while in reality when I ran a manual test, Gini provided better accuracy and recall (however, for precision, entropy was better, but the results should be based on accuracy as specified in the code). Also for maximum depth, it recommended the value 7, while in practice 9 or more gave better results.

import pandas as pd
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, classification_report
from matplotlib import pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
import numpy as np
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
column_names = ['file_path', '50', '100', '250', '500', '1000', 'r50', 'r100', 'r250', 'r500', 'r1000', 'rfile', 'class2']
df = pd.read_csv("C:/Folder/deftxt - copy.csv", sep = ';', header = 0, names = column_names)
    
x = df.drop(['class2', 'file_path'], axis=1)
df['class2'] = df['class2'].astype(int)
y = df['class2'].values
    
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, shuffle = True, random_state = 100)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)
    
model = DecisionTreeClassifier(random_state=100)
model.fit(x_train, y_train)
model.get_params()
    
k_fold_acc = cross_val_score(model, x_train, y_train, cv=10)
k_fold_mean = k_fold_acc.mean()
for i in k_fold_acc:
    print(i)
print("accuracy K Fold CV:" + str(k_fold_mean))
    
param_dist={
    "criterion":["gini", "entropy"],
    "max_depth":[1,2,3,4,5,6,7, None],
    "min_samples_split":[2,3,4,5],
}
grid = GridSearchCV(model, param_grid=param_dist, cv=10, n_jobs=-1, scoring='accuracy', verbose=1)
grid.fit(x_train, y_train)
    
print("The best results:" + str(grid.best_estimator_))
    
fn = ['50', '100', '250', '500', '1000', '-50', '-100', '-250', '-500', '-1000', 'total']
cn = ['ClassA', 'ClassB']
    
grid_predictions = grid.predict(x_test)
print(classification_report(y_test, grid_predictions))

Output:

(1369, 11) (587, 11) (1369,) (587,)
0.9927007299270073
0.9927007299270073
0.9781021897810219
0.9927007299270073
0.9927007299270073
0.9854014598540146
0.9854014598540146
0.9927007299270073
0.9781021897810219
0.9779411764705882
accuracy K Fold CV:0.9868452125375698
Fitting 10 folds for each of 64 candidates, totalling 640 fits
The best results:DecisionTreeClassifier(max_depth=7, random_state=100)
                precision    recall  f1-score   support
    
            0       0.98      0.97      0.97       174
            1       0.99      0.99      0.99       413
    
    accuracy                           0.98       587
    macro avg       0.98      0.98      0.98       587
weighted avg       0.98      0.98      0.98       587
    
    
Process finished with exit code 0

Solution

  • Why doesn't it give me any information about the criterion (e.g. whether to use entropy or Gini)?

    When you convert an Sklearn model to string, it only shows non-default parameters.

    Example:

    from sklearn.tree import DecisionTreeClassifier
    print(str(DecisionTreeClassifier(max_depth=7, random_state=100, criterion="entropy")))
    print(str(DecisionTreeClassifier(max_depth=7, random_state=100, criterion="gini")))
    

    This prints:

    DecisionTreeClassifier(criterion='entropy', max_depth=7, random_state=100)
    DecisionTreeClassifier(max_depth=7, random_state=100)
    

    The parameter criterion="gini" is not printed, because it is the default.

    To see all parameters, you can print this:

    print(str(DecisionTreeClassifier(max_depth=7, random_state=100, criterion="gini").get_params()))