pythonmachine-learningclassificationadaboost

adaBoost voting data and target form in python


I'm trying to test this implementation of a voting adaBoost classifier.

My data set has the form of 650 triplets G1, G2, G3 where G1 and G2 are contained in [1-20] and G3 is either 1 or 0 based on G1 and G2.

From what I've read cross_val_score splits the input data in training and test groups by itself but i'm doing the X,y initialization wrong. If i try to initialize X and y with the whole data set the accuracy is 100% which seems a bit off.

I've tried to put only the G3 value in y, but i got the same result.

Normally i split the data into training and testing sets and that makes things easier.

I don't have much experience with python or machine learning, but i decided to give it a try.

Could you please explain what X and y initialization should look like for this to work properly?

import os
import subprocess
import pandas as pd
import numpy as np

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier

input_file = "student_data_grades_only.csv"

data = pd.read_csv(input_file, header = 0)

X, y = data, data['G3']
print(X,y)

clf1 = LogisticRegression(random_state=1)
clf2 = RandomForestClassifier(n_estimators=50, random_state=1)
clf3 = GaussianNB()

eclf = VotingClassifier(
    estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)],
    voting='hard')

for clf, label in zip([clf1, clf2, clf3, eclf], ['Logistic Regression', 'Random Forest', 'naive Bayes', 'Ensemble']):
    scores = cross_val_score(clf, X, y, scoring='accuracy', cv=2)
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

Solution

  • You should remove G3 column from you X variable as this is what you're trying to predict.

    X,y = data.drop('G3'), data['G3']
    

    This code will work and should help you out.