I initially trained my model using an XGBOOST classifier and everything worked fine. Now, I am trying to train the model on the same data set using an XGBOOST classifier but I am running into this error: OSError: exception: access violation reading 0x0000000000000008.
This time around, I am using sklearn's bootstrapping method to randomly sample from the dataset. I first split the data set into a train set and a test set. Then I randomly sampled from the train and test sets to create 50 samples each for training and testing respectively.
The model is catching error around the .fit() line.
Kindly direct me on how I can fix this error, please.
I tried running the model outside the for loop and everything works fine but when I try with the bootstrap method then I catch the error again.
# Read each file and do analysis
for i in range(50):
# read train and test data
train_data = pd.read_csv(train_path + "\\" + "train" + str(i) + ".csv")
test_data = pd.read_csv(test_path + "\\" + "test" + str(i) + ".csv")
# Covert gender to binary
train_data['gender'] = train_data['gender'].map({1:1, 2:0})
test_data['gender'] = test_data['gender'].map({1:1, 2:0})
# Apply standard scalar to numerical columns
sc = StandardScaler()
train_data[['age', 'RXDCOUNT', 'income', 'RXDDAYS', 'ALQ130', 'OCD270', 'BMXBMI', 'BMXHT', 'BMXWT']] = sc.fit_transform(train_data[['age', 'RXDCOUNT', 'income', 'RXDDAYS', 'ALQ130', 'OCD270', 'BMXBMI', 'BMXHT', 'BMXWT']])
test_data[['age', 'RXDCOUNT', 'income', 'RXDDAYS', 'ALQ130', 'OCD270', 'BMXBMI', 'BMXHT', 'BMXWT']] = sc.fit_transform(test_data[['age', 'RXDCOUNT', 'income', 'RXDDAYS', 'ALQ130', 'OCD270', 'BMXBMI', 'BMXHT', 'BMXWT']])
# Create X_train, X_test, y_train, y_test
y_train = train_data["depression"]
y_test = test_data["depression"]
X_train = train_data.drop("depression", axis=1, inplace=True)
X_test = test_data.drop("depression", axis=1, inplace=True)
#print(y_train)
# Create model
model = XGBClassifier(use_label_encoder=False)
# Fit model with train data
_= model.fit(X_train, y_train)
# Predict on test set
y_pred = model.predict(X_test)
# Get accuracy of model
acc = model.score(X_test, y_test)
# get balanced accuracy
balAcc = balanced_accuracy_score(y_test, y_pred)
# roc_auc
roc_auc = roc_auc_score(y_true=y_test,y_score=model.predict_proba(X_test)[:,1])
# add y_pred to test set
predict_dataframe = prediction_dataframe(test_data, y_pred)
# define protected attributes.
p_attr1 = "gender"
p_attr2 = "ethnicity"
# compute TP, FP, TN, FN based on single protected attributes
tp, fp, tn, fn = compute_metrics_s(predict_dataframe, p_attr1)
# compute TPR based on single protected attributes
tpr_male = list(tp.values())[0] / np.add(list(tp.values())[0], list(fn.values())[0])
tpr_female = list(tp.values())[1] / np.add(list(tp.values())[1], list(fn.values())[1])
EOD = np.subtract(tpr_male, tpr_female)
dic_data["roc_auc"].append(roc_auc)
dic_data["bacc"].append(balAcc)
dic_data["EOD"].append(EOD)
dic_data["tpr_male"].append(tpr_male)
dic_data["tpr_female"].append(tpr_female)
i += 1
if i == 49:
df = pd.DataFrame.from_dict(dic_data)
df.to_csv(results\dataframe\suppression\gender.csv", index=True)
The issue was with my X_train and X_test were returning None datatypes. So when I modified the following lines;
X_train = train_data.drop("depression", axis=1, inplace=True)
X_test = test_data.drop("depression", axis=1, inplace=True)
to: X_train = train_data.drop("depression", axis=1)
X_test = test_data.drop("depression", axis=1)
then the problem was solved.