pythontensorflowmachine-learningscikit-learnsql-injection

Problem with identical metrics results after machine learning


When trying machine learning on a dataset, I got the same results for metrics such as accuracy and F-score on different machine learning algorithms.

I have a dataset on which I trained my chosen algorithms. I found it on the Kaggle website: source.

Here are code snippets from the Jupiter file, and the results of their execution:

List of connected libraries

IN:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from sklearn.metrics import accuracy_score, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report
import joblib
import tensorflow as tf
import numpy as np
from tensorflow.keras import models, layers
import warnings

warnings.filterwarnings('ignore')

Loading dataset

IN:

df = pd.read_csv("payload_mini.csv",encoding='utf-16')
df.head(10)

Load, process and split the data for further training of the classification model

IN:

df = pd.read_csv("payload_mini.csv",encoding='utf-16')

df = df[(df['attack_type'] == 'sqli') | (df['attack_type'] == 'norm')]

X = df['payload']
y = df['label']

vectorizer = CountVectorizer(min_df = 2, max_df = 0.8, stop_words = stopwords.words('english'))
X = vectorizer.fit_transform(X.values.astype('U')).toarray()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

OUT:

(8040, 1585)
(8040,)
(2011, 1585)
(2011,)

Naive Bayes Classifier

IN:

nb_clf = GaussianNB()
nb_clf.fit(X_train, y_train)
y_pred = nb_clf.predict(X_test)
print(f"Accuracy of Naive Bayes on test set : {accuracy_score(y_pred, y_test)}")
print(f"F1 Score of Naive Bayes on test set : {f1_score(y_pred, y_test, pos_label='anom')}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

OUT:

Accuracy of Naive Bayes on test set : 0.9806066633515664
F1 Score of Naive Bayes on test set : 0.9735234215885948

Classification Report:
              precision    recall  f1-score   support

        anom       0.97      0.98      0.97       732
        norm       0.99      0.98      0.98      1279

    accuracy                           0.98      2011
   macro avg       0.98      0.98      0.98      2011
weighted avg       0.98      0.98      0.98      2011

Random forest algorithm:

IN:

rf_clf = RandomForestClassifier()
rf_clf.fit(X_train, y_train)
y_pred_rf = rf_clf.predict(X_test)
print(f"Accuracy of Random Forest on test set : {accuracy_score(y_pred, y_test)}")
print(f"F1 Score of Random Forest on test set : {f1_score(y_pred, y_test, pos_label='anom')}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_rf))

OUT:

Accuracy of Random Forest on test set : 0.9806066633515664
F1 Score of Random Forest on test set : 0.9735234215885948

Classification Report:
              precision    recall  f1-score   support

        anom       1.00      0.96      0.98       732
        norm       0.98      1.00      0.99      1279

    accuracy                           0.99      2011
   macro avg       0.99      0.98      0.99      2011
weighted avg       0.99      0.99      0.99      2011

Support vector machine

IN:

svm_clf = SVC(gamma = 'auto')
svm_clf.fit(X_train, y_train)
y_pred = svm_clf.predict(X_test)
print(f"Accuracy of SVM on test set : {accuracy_score(y_pred, y_test)}")
print(f"F1 Score of SVM on test set: {f1_score(y_pred, y_test, pos_label='anom')}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

OUT:

Accuracy of SVM on test set : 0.9189457981103928
F1 Score of SVM on test set: 0.8658436213991769

Classification Report:
              precision    recall  f1-score   support

        anom       1.00      0.76      0.87       689
        norm       0.89      1.00      0.94      1322

    accuracy                           0.92      2011
   macro avg       0.95      0.88      0.90      2011
weighted avg       0.93      0.92      0.92      2011

As you can see when training on different machine learning algorithms, we get the same results in the case of random forest and naive Bayesian classifier. I hope you can help me to fix a possible bug in the code or improve it in some way.


Solution

  • In your code for Random Forest, you're storing predictions as y_pred_rf but calling your metrics on y_pred