pythonmachine-learningclassificationdecision-tree

How does persisting the model increase accuracy?


import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score

whitewine_data = pd.read_csv('winequality-white.csv', 
delimiter=';')

variables = ['alcohol_cat', 'alcohol', 'sulphates', 'density', 
'total sulfur dioxide', 'citric acid', 'volatile acidity', 
'chlorides']

X = whitewine_data[variables]
y = whitewine_data['quality']
X_train, X_test, y_train, y_test = train_test_split(X, y, 
test_size=0.2)

model = DecisionTreeClassifier()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')

predictions = model.predict([[0.27, 0.36, 0.045, 170, 1.001, 
0.45, 8.9, 0]])
print(f'Predicted Output: {predictions}')
print(f'Accuracy: {accuracy * 100}%')
print(f'F1 Score: {f1 * 100}% ')

This initial model resulted in a accuracy score of 57%

==============================================================

whitewine_data = pd.read_csv('winequality-white.csv', 
delimiter=';')

# Variables to be dropped from the data set - NOT THE INPUT 
VARIABLES
variables = ['fixed acidity', 'residual sugar', 'free sulfur 
dioxide', 'pH', 'quality', 'isSweet']

X = whitewine_data.drop(variables, axis=1)
y = whitewine_data['quality']

X_train, X_test, y_train, y_test = train_test_split(X, y, 
test_size=0.2)

model = DecisionTreeClassifier()
model.fit(X_train, y_train)

joblib.dump(model, 'WhiteWine_Quality_Predictor.joblib')

Creating the Saved model

==============================================================

whitewine_data = pd.read_csv('winequality-white.csv', 
delimiter=';') 

variables = ['volatile acidity', 'citric acid', 'chlorides', 
'total sulfur dioxide', 'density', 'sulphates', 'alcohol', 
'alcohol_cat']

X_test = whitewine_data[variables]
y_test = whitewine_data['quality']  

model = joblib.load('WhiteWine_Quality_Predictor.joblib')

y_pred = model.predict(X_test)

f1 = f1_score(y_test, y_pred, average='weighted')
accuracy = accuracy_score(y_test, y_pred)
predictions = model.predict([[0.27, 0.36, 0.045, 170, 1.001, 
0.45, 10.9, 3]])

print(f'F1 Score: {f1 * 100}%')
print(f'Model Accuracy: {accuracy * 100}%')
print(f'Predicted Output: {predictions}')

Calling saved model now resulted in 92% accuracy

Question: How does calling a saved model result in the increase of accuracy I saw


Solution

  • That's quite a common mistake when handling with ML algorithms at the beginning.

    In your second script, you are training the algorithm on the winequality-white.csv dataset, and then you are saving it. That's totally fine.

    The problem is that in your third, you are using the algorithm on the exact same dataset you've used for training. You are basically predict observations that you used for training, so it's pretty obvious that the algorithm will predict them with 100% accuracy.

    The approach of storing the algorithm is correct, but then you have to use another dataset for actually using it for predictions, not the same you used for training.