pythonscikit-learngensimword2vecnaivebayes

Negative values in data passed to MultinomialNB when vectorize using Word2Vec


I am currently working on a project where I'm attempting to use Word2Vec in combination with Multinomial Naive Bayes (MultinomialNB) for accuracy calculations.

import pandas as pd
import numpy as np, sys
from sklearn.model_selection import train_test_split
from gensim.models import Word2Vec
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score
from datasets import load_dataset

df = load_dataset('celsowm/bbc_news_ptbr', split='train')
X = df['texto']
y = df['categoria']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

sentences = [sentence.split() for sentence in X_train]
w2v_model = Word2Vec(sentences, vector_size=100, window=5, min_count=5, workers=4)

def vectorize(sentence):
    words = sentence.split()
    words_vecs = [w2v_model.wv[word] for word in words if word in w2v_model.wv]
    if len(words_vecs) == 0:
        return np.zeros(100)
    words_vecs = np.array(words_vecs)
    return words_vecs.mean(axis=0)

X_train = np.array([vectorize(sentence) for sentence in X_train])
X_test = np.array([vectorize(sentence) for sentence in X_test])
clf = MultinomialNB()
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Precision:', precision_score(y_test, y_pred, pos_label='positive'))

However, I've encountered an error:

ValueError("Negative values in data passed to %s" % whom)
ValueError: Negative values in data passed to MultinomialNB (input X)

I would appreciate any insights into resolving this issue.


Solution

  • The Error

    Each word2vec embedding for a word is a vector whose elements can take any real number value. This means that even after you take the mean of all vectors, there might be some negative values in the final vector.

    This is not a problem. However, since you are using Multinomial Naive Bayes (MNB), it is causing problems.

    Why? MNB assumes that the data follows a multinomial distribution, a generalization of the binomial distribution. It is based entirely on the idea of counts of successes (1s) and failures (0s). Thus, you can imagine why scikit-learn complains about MNB getting negative values.

    The Solution

    If you want to keep the model as MNB, you will have to do away with the negative values. Some ideas (as per this link):

    You can also change the vectorization method from Word2Vec to CountVectorizer, TfidfVectorizer. Tfidf will work even though it gives fractional values in the final vector. MNB is not designed to work with fractions, only integers, but it works in practice!

    If you are okay with using another model, you can try some model options below:

    Code

    Example using MinMaxScaler:

    Just switch the line clf = MultinomialNB() to the following.

    from sklearn.preprocessing import MinMaxScaler
    from sklearn.pipeline import Pipeline
    
    clf = Pipeline([
        ('scaler', MinMaxScaler()),
        ('clf', MultinomialNB()),
    ])
    

    Transformers

    Depending upon your task, you might also want to check out transformers. They have their own vectorization method, generating dense semantic embeddings instead of working at a purely syntactic level as word2vec does. These models are much bigger, and computationally expensive, but will produce much better results if machine learning models fail to satisfy with accuracy.

    Further Readings

    Feel free to ask any questions!