I'm trying to test this implementation of a voting adaBoost classifier.
My data set has the form of 650 triplets G1, G2, G3 where G1 and G2 are contained in [1-20] and G3 is either 1 or 0 based on G1 and G2.
From what I've read cross_val_score splits the input data in training and test groups by itself but i'm doing the X,y initialization wrong. If i try to initialize X and y with the whole data set the accuracy is 100% which seems a bit off.
I've tried to put only the G3 value in y, but i got the same result.
Normally i split the data into training and testing sets and that makes things easier.
I don't have much experience with python or machine learning, but i decided to give it a try.
Could you please explain what X and y initialization should look like for this to work properly?
import os
import subprocess
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
input_file = "student_data_grades_only.csv"
data = pd.read_csv(input_file, header = 0)
X, y = data, data['G3']
print(X,y)
clf1 = LogisticRegression(random_state=1)
clf2 = RandomForestClassifier(n_estimators=50, random_state=1)
clf3 = GaussianNB()
eclf = VotingClassifier(
estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)],
voting='hard')
for clf, label in zip([clf1, clf2, clf3, eclf], ['Logistic Regression', 'Random Forest', 'naive Bayes', 'Ensemble']):
scores = cross_val_score(clf, X, y, scoring='accuracy', cv=2)
print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))
You should remove G3 column from you X variable as this is what you're trying to predict.
X,y = data.drop('G3'), data['G3']
This code will work and should help you out.