pythonmachine-learningscikit-learnnaivebayesk-fold

Gaussian Naive Bayes gives weird results


This is a basic implementation of Gaussian Bayes using sklearn. Can anyone tell me what I'm doing wrong here, my K-Fold CV results are a bit weird:

import numpy as np
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, recall_score, precision_score, classification_report
import csv
from sklearn.model_selection import cross_val_score

column_names = ['AS', 'fh', 'class2']
df = pd.read_csv("C:/Users/Jans/Music/docx/222/test.csv",  sep=';', header = 0, names = column_names)

x = df.drop(['AS', 'class2'], axis=1)
df['class2'] = df['class2'].astype(int)
y = df['class2'].values

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, shuffle = False, random_state = None)

model = GaussianNB()
model.fit(x_train, y_train.astype('int'))

k_fold_acc = cross_val_score(model, x_train, y_train, cv=10)
k_fold_mean = k_fold_acc.mean()
for i in k_fold_acc:
    print(i)
print("accuracy K Fold CV:" + str(k_fold_mean))

grid_predictions = model.predict(x_test)

my 10 Fold CV results (especially the first fold is very strange...):

0.36714285714285716
0.8271428571428572
0.9785714285714285
0.9357142857142857
0.9628571428571429
0.9957081545064378
1.0
1.0
0.994277539341917
0.9842632331902719
accuracy K Fold CV:0.90456774984672

Also, when I increase my test set from suppose 0.2 to 0.6 these are the results, which is also a bit strange.

Am I doing something wrong? And if yes, what?

1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
accuracy K Fold CV:1.0

Solution

  • Regarding the second problem: when you increase the test set size to 0.6, this reduces the size of train set and makes it easier for the model to memorize your training data (overfitting). I think what you're seeing is that the model has overfit, attaining perfect accuracy. To regularise the model (reducing its tendency to overfit), increase the training data or make the model more regularised by introducing priors= for example.

    Not sure about the first problem - it might just be 'sampling noise' where the first fold was a lot harder. If you've got a small test set there'll be more sampling-related variability in the folds. In your case, with 10-fold CV, the test set is 10% of the training data, and if the training set is small to begin with, then 10% of that is going to be even smaller. Set random_state=0 in order to get repeatable results, and that'll allow you to dig deeper into fold 0 if needed.