pythontrain-test-split

Train-test and Naive Bayesian Classifieres


I have three sets of data: easy_ham, hard_ham, and spam; all of which contain sets of emails. I am trying to perform a train-test split on the datasets, using the training sets to train a classifier and evaluate its performance against the test sets, which will later be used in two different Na¨ıve Bayesian Classifiers. I am testing it on the easy_ham first but I run into a problem: UndefinedMetricWarning:

Precision is ill-defined and being set to 0.0 due to no predicted samples. Use zero_division parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))

UndefinedMetricWarning: Recall is ill-defined and being set to 0.0 due to no true samples. Use zero_division parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))

I am very new to programming in general and data science especially so I'm struggling with understanding what my mistake is.

My code:

import tarfile
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix
from sklearn.naive_bayes import BernoulliNB


def read_data(file):
    dataset = list()
    with tarfile.open(file) as tar:
        for i in tar:
            if i.isfile():
                with tar.extractfile(i) as f:
                    content = f.read()
                    decoded_content = None
                    if file is not None:
                        charset = ['utf-8', 'iso-8859-1', 'ascii']
                        for char in charset:
                            try:
                                decoded_content = content.decode(char)
                                break
                            except:
                                continue
                    if decoded_content is not None:
                        dataset.append(decoded_content)
        return dataset


data = {
    'easy_ham': read_data('20021010_easy_ham.tar.bz2'),
    'hard_ham': read_data('20021010_hard_ham.tar.bz2'),
    'spam': read_data('20021010_spam.tar.bz2')
}
x1 = data.get('easy_ham')
x2 = data.get('hard_ham')
x3 = data.get('spam')
y1 = ['easy_ham'] * len(x1)
y2 = ['hard_ham'] * len(x2)
y3 = ['spam'] * len(x3)

vectorizer = CountVectorizer()


def preprocessing(x):
    CountVectorizer()
    v = vectorizer.fit_transform(x)
    v_array = v.toarray()
    return v_array


def train_test(x, y, size):
    v = preprocessing(x)
    X_train, X_test, y_train, y_test = train_test_split(v, y, test_size=size, random_state=42)
    return X_train, X_test, y_train, y_test


X_train1, X_test1, y_train1, y_test1 = train_test(x1, y1, 0.2)
X_train2, X_test2, y_train2, y_test2 = train_test(x2, y2, 0.2)
X_train3, X_test3, y_train3, y_test3 = train_test(x3, y3, 0.2)

print("Train-test split for easy ham:")
print("X_train1:", len(X_train1))
print("X_test1:", len(X_test1))
print("y_train1:", len(y_train1))
print("y_test1:", len(y_test1))

print("\nTrain-test split for hard ham:")
print("X_train2:", len(X_train2))
print("X_test2:", len(X_test2))
print("y_train2:", len(y_train2))
print("y_test2:", len(y_test2))

print("\nTrain-test split for spam:")
print("X_train3:", len(X_train3))
print("X_test3:", len(X_test3))
print("y_train3:", len(y_train3))
print("y_test3:", len(y_test3))


def multinomial_classifier(x_train, y_train, x_test, y_test):
    clf = MultinomialNB()
    clf.fit(x_train, y_train)
    predictions = clf.predict(x_test)
    accuracy = accuracy_score(y_test, predictions)
    precision = precision_score(y_test, predictions, average='weighted')
    recall = recall_score(y_test, predictions, average='weighted')
    confusion_mat = confusion_matrix(y_test, predictions, labels=['easy_ham', 'spam'])
    return predictions, accuracy, precision, recall, confusion_mat


def bernoulli_classifier(x_train, y_train, x_test, y_test):
    clf = BernoulliNB()
    clf.fit(x_train, y_train)
    predictions = clf.predict(x_test)
    accuracy = accuracy_score(y_test, predictions)
    precision = precision_score(y_test, predictions, pos_label='spam')
    recall = recall_score(y_test, predictions, pos_label='spam')
    confusion_mat = confusion_matrix(y_test, predictions, labels=['easy_ham', 'spam'])
    return predictions, accuracy, precision, recall, confusion_mat

predictions_e, accuracy_e, precision_e, recall_e, confusion_mat_e = bernoulli_classifier(X_train1, y_train1, X_test1, y_test1)
print(predictions_e)

I tried different approaches by changing the data I put into multinomial_classifier and performing the test on vectorized data rather than raw data(was getting value errors for multinomial_classifier and bernoulli_classifier). The output of train-test:

Train-test split for easy ham:

X_train1: 2040
X_test1: 511
y_train1: 2040
y_test1: 511

Train-test split for hard ham:

X_train2: 200
X_test2: 50
y_train2: 200
y_test2: 50

Train-test split for spam:

X_train3: 400
X_test3: 101
y_train3: 400
y_test3: 101

Solution

  • Problem

    The warn tells you that the model did not predict any positive label, so makes no sense to compute accuracy or recall.

    Root cause

    1. Your data is imbalanced

    The majority of available data comes from one class and there is only a few examples of the other classes, so, the model cannot learn properly the distribution of the minority classes.

    The best scenario is when you have equal number of examples from all classes, or at least close to equal (1/2, 1/2), (1/3, 1/3, 1/3), etc... In your case is approx: (2000/2600, 200/2600, 400/2600) = (0.77, 0.08, 0.15). So it is clearly unbalanced. you can mitigate this by weighting samples during training.

    2. Choose an appropriate model and set samples weights

    You are trying to fit Naive Bayes or Bernoulli models with strong underlying distribution assumptions (normality, bernoulli distribution, etc).

    I recommend you to go with non parametric models such as Trees, they do not assume any distribution on data and are robust to overfitting so you do not have to care about statistical stuff, for now. Tree methods usually outperform other models when non prior knowledge is available about underlying statistical distributions, as in your case.

    Solution

    I provide you an example with Trees and balancing classes.

    Following your functions structure, try this:

    from sklearn.ensemble import RandomForestClassifier
    
    def random_forest_classifier(x_train, y_train, x_test, y_test):
        clf = RandomForestClassifier(class_weight='balanced')
        clf.fit(x_train, y_train)
        predictions = clf.predict(x_test)
        accuracy = accuracy_score(y_test, predictions)
        precision = precision_score(y_test, predictions, pos_label='spam')
        recall = recall_score(y_test, predictions, pos_label='spam')
        confusion_mat = confusion_matrix(y_test, predictions, labels=['easy_ham', 'spam'])
        return predictions, accuracy, precision, recall, confusion_mat
    
    predictions_e, accuracy_e, precision_e, recall_e, confusion_mat_e = random_forest_classifier(X_train1, y_train1, X_test1, y_test1)
    

    the balanced parameters tells that samples must be weighted by the number of samples in the training data. It should mitigate unbalancing effects.