pythonclassificationsentiment-analysisnaivebayescountvectorizer

My Naive Bayes classifier works for my model but will not accept user input on my application


I am trying to deploy my machine learning Naive Bayes sentiment analysis model onto a web application. The idea is that the user should type some text, which the application performs sentiment analysis on and then stores the text with the assigned sentiment in another column within the database, to be called as a list via html later.

While the model and vectorizer work fine on Google Colab, when I load the model to my application and try to run the user input through it it won't work. I have gotten many error codes depending on the different solutions I've tried.

The most recent is:

ValueError: DataFrame constructor not properly called!

But when I try to fix this I get other error messages such as:

'numpy.ndarray' object has no attribute 'lower'

or:

ValueError: X has 1 features, but MultinomialNB is expecting 26150 features as input.

or:

sklearn.exceptions.NotFittedError: Vocabulary not fitted or provided

BASICALLY I don't know what I'm doing and I've been trying to figure it out for weeks. My inclination is that the problem is either the format coming in from the user is not readable by the model, or the vectorizer is not working on the input.

OR maybe my whole approach is wrong, and there are steps I am missing. Any help with this would be massively appreciated.

My model code looks like this (after preprocessing):

#Split into training and testing data
x = df['text']
y = df['sentiment']

df1 = df[df["text"].notnull()]
x1 = df1['text']
y1 = df1['sentiment']

x1_train, x1_test, y1_train, y1_test = train_test_split(x1, y1, test_size=0.2, random_state=30)

# Vectorize text
vec = CountVectorizer(stop_words='english')
x1 = vec.fit_transform(x1).toarray()
x1_test = vec.transform(x1_test).toarray()

df1 = df1.replace(r'^\s*$', np.nan, regex=True)

from sklearn.naive_bayes import MultinomialNB

sentiment_model = MultinomialNB()
sentiment_model.fit(x1, y1)
sentiment_model.score(x1_test, y1_test)

# Save model to disk
pickle.dump(sentiment_model, open('sentiment_model.pkl','wb'))

And my application code looks like this:

@app.route('/journal', methods=['GET', 'POST'])
def entry():
    if request.method == 'POST':
        journals = request.form
        
        entry_date = journals['entry_date']
        journal_entry = journals['journal_entry']

        vec = CountVectorizer(stop_words='english')
        sdf = pd.DataFrame('journal_entry')
        sdf = vec.fit_transform(sdf).toarray()
        sdf = vec.transform(sdf).toarray()

        sentiment = sentiment_model.predict(sdf)
        journals['sentiment'] = sentiment

        cur = mysql.connection.cursor()
        #insert the values with sentiment attribute into database
        cur.execute("INSERT INTO journals(entry_date, journal_entry, sentiment) VALUES(%s, %s, %s)",(entry_date, journal_entry, sentiment))
        mysql.connection.commit()
   
    return render_template('journal.html')

Solution

  • So it seems to me that there are multiple issues here at play.

    For one, sdf = pd.DataFrame('journal_entry') does not make sense -- you create the data frame from literal string 'journal_entry', not the actual contents of it? I suggest you get rid of the DataFrame in your entry function entirely, as it is not a required input structure for sklearn objects.

    Secondly, you're duplicating functionality with calling fit_transform and then again transform in your entry function. It's sufficient to call fit_transform as it's doing two things: 1) it learns the dictionary 2) it transforms to document-term matrix.

    Thirdly, you trained your model using a specific CountVectorizer model. This model will transform each document into vectors using the learned document-term matrix which acquires a fixed size at the time you call fit or fit_transform function. Then your Naive Bayes model is trained using this fixed sized vector. Hence, it complains when it gets a different sized vector at inference time -- this is because you're re-initializing CountVectorizer again at each entry call. You need to save the CountVectorizer as well if you want to preserve the feature size.

    Also, I'd suggest making some check in your entry function that makes sure you get valid strings for your algorithm in the POST request.

    
    # load both CountVectorizer and the model 
    vec = pickle.load(open("my_count_vec.pkl", "rb"))
    sentiment_model = pickle.load(open("my_sentiment_model", "rb"))
    
    @app.route('/journal', methods=['GET', 'POST'])
    def entry():
        if request.method == 'POST':
            journals = request.form
            
            entry_date = journals['entry_date']
            journal_entry = journals['journal_entry']
            sdf = vec.transform([journal_entry]).reshape(1, -1)
            sentiment = sentiment_model.predict(sdf)
            ...
    

    sdf = vec.transform([journal_entry]).reshape(1, -1) assumes that journal entry is a single string and hence it needs reshaping for further processing.