I'm very new to python and need to calculate the ROC and AUC of two binary classification models using NLP data. I can't seem to get my head around sparse vs dense arrays (I mean, I get that sparse arrays contain a ton of zeros, and dense arrays do not), data shape, and dimensionality.
I think I can produce pretty good preprocessed data, but inputting that into my classifiers in a way they can read has me stymied.
In my code below, you'll note that I have tried more than one train test split. I get
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
if I don't convert x and y to dense.
I get
ValueError: y should be a 1d array, got an array of shape (1594, 286579) instead
UndefinedMetricWarning: No positive samples in y_true, true positive value should be meaningless
when I do the dense conversion.
And I get
ValueError: Found input variables with inconsistent numbers of samples: [1594, 399]
when (if I'm remembering correctly) using the commented out train test split.
Here is my messy, redundant code:
import joblib
import re
import string
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, cohen_kappa_score, f1_score, classification_report
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.naive_bayes import MultinomialNB
categories = ['rec.sport.baseball', 'rec.sport.hockey']
news_group_data = fetch_20newsgroups(subset="all", remove=("headers", "footers", "quotes"), categories=categories)
df = pd.DataFrame(dict(text=news_group_data["data"],target=news_group_data["target"]))
df["target"] = df.target.map(lambda x: categories[x])
def process_text(text):
text = str(text).lower()
text = re.sub(f"[{re.escape(string.punctuation)}]", " ", text)
text = " ".join(text.split())
return text
df["clean_text"] = df.text.map(process_text)
#df_train, df_test = train_test_split(df, test_size=0.20, stratify=df.target)
vec = CountVectorizer(ngram_range=(1, 3), stop_words="english",)
x = vec.fit_transform(df.clean_text)
y = vec.transform(df.clean_text)
#X = vec.fit_transform(df_train.clean_text)
#Y = vec.transform(df_test.clean_text)
X = x.toarray()
Y = y.toarray()
#y_train = df_train.target
#y_test = df_test.target
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.2,
random_state=0)
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=None, max_features=5,
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, #min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=500,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)
nb = GaussianNB()
nb.fit(X_train, Y_train)
r_probs = [0 for _ in range(len(Y_test))]
rf_probs = rf.predict_proba(X_test)
nb_probs = nb.predict_proba(X_test)
rf_probs = rf_probs[:, 1]
nb_probs = nb_probs[:, 1]
from sklearn.metrics import roc_curve, roc_auc_score
r_auc = roc_auc_score(Y_test, r_probs)
rf_auc = roc_auc_score(Y_test, rf_probs)
nb_auc = roc_auc_score(Y_test, nb_probs)
print('Random (chance) Prediction: AUROC = %.3f' % (r_auc))
print('Random Forest: AUROC = %.3f' % (rf_auc))
print('Naive Bayes: AUROC = %.3f' % (nb_auc))
r_fpr, r_tpr, _ = roc_curve(Y_test, r_probs)
rf_fpr, rf_tpr, _ = roc_curve(Y_test, rf_probs)
nb_fpr, nb_tpr, _ = roc_curve(Y_test, nb_probs)
import matplotlib.pyplot as plt
plt.plot(r_fpr, r_tpr, linestyle='--', label='Random prediction (AUROC = %0.3f)' % r_auc)
plt.plot(rf_fpr, rf_tpr, marker='.', label='Random Forest (AUROC = %0.3f)' % rf_auc)
plt.plot(nb_fpr, nb_tpr, marker='.', label='Naive Bayes (AUROC = %0.3f)' % nb_auc)
plt.title('ROC Plot')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()
The problem is that you are not using the correct target. You are basically encoding two times the text with the CountVectorizer
, in these lines:
x = vec.fit_transform(df.clean_text)
y = vec.transform(df.clean_text)
Instead you should encode the binary class in df.target as target for the model (your Y
)
def labeling(v):
if v == categories[0]:
return 0
else:
return 1
df["target_encod"] = df.target.map(labeling)
print(df['target_encod'])
after that you can use the correct y for your machine learning problem
X = x.toarray()
Y = df["target_encod"].values
My result after the changes:
For the next question, you forgot to assign a variable to the randomForest instance
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=None, max_features=5,
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, #min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=500,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)
instead of
rf = RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=None, max_features=5,
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, #min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=500,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)