I'm very new to python and need to calculate the ROC and AUC of two binary classification models using NLP data. I can't seem to get my head around sparse vs dense arrays (I mean, I get that sparse arrays contain a ton of zeros, and dense arrays do not), data shape, and dimensionality.
I think I can produce pretty good preprocessed data, but inputting that into my classifiers in a way they can read has me stymied.
In my code below, you'll note that I have tried more than one train test split. I get
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
if I don't convert x and y to dense.
I get
ValueError: y should be a 1d array, got an array of shape (1594, 286579) instead
UndefinedMetricWarning: No positive samples in y_true, true positive value should be meaningless
when I do the dense conversion.
And I get
ValueError: Found input variables with inconsistent numbers of samples: [1594, 399]
when (if I'm remembering correctly) using the commented out train test split.
Here is my messy, redundant code:
import joblib
import re
import string
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, cohen_kappa_score, f1_score, classification_report
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.naive_bayes import MultinomialNB
categories = ['rec.sport.baseball', 'rec.sport.hockey']
news_group_data = fetch_20newsgroups(subset="all", remove=("headers", "footers", "quotes"), categories=categories)
df = pd.DataFrame(dict(text=news_group_data["data"],target=news_group_data["target"]))
df["target"] = df.target.map(lambda x: categories[x])
def process_text(text):
text = str(text).lower()
text = re.sub(f"[{re.escape(string.punctuation)}]", " ", text)
text = " ".join(text.split())
return text
df["clean_text"] = df.text.map(process_text)
#df_train, df_test = train_test_split(df, test_size=0.20, stratify=df.target)
vec = CountVectorizer(ngram_range=(1, 3), stop_words="english",)
x = vec.fit_transform(df.clean_text)
y = vec.transform(df.clean_text)
#X = vec.fit_transform(df_train.clean_text)
#Y = vec.transform(df_test.clean_text)
X = x.toarray()
Y = y.toarray()
#y_train = df_train.target
#y_test = df_test.target
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.2,
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=None, max_features=5,
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, #min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=500,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)
nb = GaussianNB()
nb.fit(X_train, Y_train)
r_probs = [0 for _ in range(len(Y_test))]
rf_probs = rf.predict_proba(X_test)
nb_probs = nb.predict_proba(X_test)
rf_probs = rf_probs[:, 1]
nb_probs = nb_probs[:, 1]
from sklearn.metrics import roc_curve, roc_auc_score
r_auc = roc_auc_score(Y_test, r_probs)
rf_auc = roc_auc_score(Y_test, rf_probs)
nb_auc = roc_auc_score(Y_test, nb_probs)
print('Random (chance) Prediction: AUROC = %.3f' % (r_auc))
print('Random Forest: AUROC = %.3f' % (rf_auc))
print('Naive Bayes: AUROC = %.3f' % (nb_auc))
r_fpr, r_tpr, _ = roc_curve(Y_test, r_probs)
rf_fpr, rf_tpr, _ = roc_curve(Y_test, rf_probs)
nb_fpr, nb_tpr, _ = roc_curve(Y_test, nb_probs)
import matplotlib.pyplot as plt
plt.plot(r_fpr, r_tpr, linestyle='--', label='Random prediction (AUROC = %0.3f)' % r_auc)
plt.plot(rf_fpr, rf_tpr, marker='.', label='Random Forest (AUROC = %0.3f)' % rf_auc)
plt.plot(nb_fpr, nb_tpr, marker='.', label='Naive Bayes (AUROC = %0.3f)' % nb_auc)
plt.title('ROC Plot')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
The problem is that you are not using the correct target. You are basically encoding two times the text with the CountVectorizer
, in these lines:
x = vec.fit_transform(df.clean_text)
y = vec.transform(df.clean_text)
Instead you should encode the binary class in df.target as target for the model (your Y
def labeling(v):
if v == categories[0]:
return 0
return 1
df["target_encod"] = df.target.map(labeling)
after that you can use the correct y for your machine learning problem
X = x.toarray()
Y = df["target_encod"].values
My result after the changes:
For the next question, you forgot to assign a variable to the randomForest instance
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=None, max_features=5,
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, #min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=500,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)
instead of
rf = RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=None, max_features=5,
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, #min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=500,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)