I am trying to deploy my machine learning Naive Bayes sentiment analysis model onto a web application. The idea is that the user should type some text, which the application performs sentiment analysis on and then stores the text with the assigned sentiment in another column within the database, to be called as a list via html later.
While the model and vectorizer work fine on Google Colab, when I load the model to my application and try to run the user input through it it won't work. I have gotten many error codes depending on the different solutions I've tried.
The most recent is:
ValueError: DataFrame constructor not properly called!
But when I try to fix this I get other error messages such as:
'numpy.ndarray' object has no attribute 'lower'
or:
ValueError: X has 1 features, but MultinomialNB is expecting 26150 features as input.
or:
sklearn.exceptions.NotFittedError: Vocabulary not fitted or provided
BASICALLY I don't know what I'm doing and I've been trying to figure it out for weeks. My inclination is that the problem is either the format coming in from the user is not readable by the model, or the vectorizer is not working on the input.
OR maybe my whole approach is wrong, and there are steps I am missing. Any help with this would be massively appreciated.
My model code looks like this (after preprocessing):
#Split into training and testing data
x = df['text']
y = df['sentiment']
df1 = df[df["text"].notnull()]
x1 = df1['text']
y1 = df1['sentiment']
x1_train, x1_test, y1_train, y1_test = train_test_split(x1, y1, test_size=0.2, random_state=30)
# Vectorize text
vec = CountVectorizer(stop_words='english')
x1 = vec.fit_transform(x1).toarray()
x1_test = vec.transform(x1_test).toarray()
df1 = df1.replace(r'^\s*$', np.nan, regex=True)
from sklearn.naive_bayes import MultinomialNB
sentiment_model = MultinomialNB()
sentiment_model.fit(x1, y1)
sentiment_model.score(x1_test, y1_test)
# Save model to disk
pickle.dump(sentiment_model, open('sentiment_model.pkl','wb'))
And my application code looks like this:
@app.route('/journal', methods=['GET', 'POST'])
def entry():
if request.method == 'POST':
journals = request.form
entry_date = journals['entry_date']
journal_entry = journals['journal_entry']
vec = CountVectorizer(stop_words='english')
sdf = pd.DataFrame('journal_entry')
sdf = vec.fit_transform(sdf).toarray()
sdf = vec.transform(sdf).toarray()
sentiment = sentiment_model.predict(sdf)
journals['sentiment'] = sentiment
cur = mysql.connection.cursor()
#insert the values with sentiment attribute into database
cur.execute("INSERT INTO journals(entry_date, journal_entry, sentiment) VALUES(%s, %s, %s)",(entry_date, journal_entry, sentiment))
mysql.connection.commit()
return render_template('journal.html')
So it seems to me that there are multiple issues here at play.
For one, sdf = pd.DataFrame('journal_entry')
does not make sense -- you create the data frame from literal string 'journal_entry', not the actual contents of it? I suggest you get rid of the DataFrame in your entry
function entirely, as it is not a required input structure for sklearn objects.
Secondly, you're duplicating functionality with calling fit_transform
and then again transform
in your entry
function. It's sufficient to call fit_transform
as it's doing two things: 1) it learns the dictionary 2) it transforms to document-term matrix.
Thirdly, you trained your model using a specific CountVectorizer model. This model will transform each document into vectors using the learned document-term matrix which acquires a fixed size at the time you call fit
or fit_transform
function. Then your Naive Bayes model is trained using this fixed sized vector. Hence, it complains when it gets a different sized vector at inference time -- this is because you're re-initializing CountVectorizer
again at each entry
call. You need to save the CountVectorizer
as well if you want to preserve the feature size.
Also, I'd suggest making some check in your entry
function that makes sure you get valid strings for your algorithm in the POST request.
# load both CountVectorizer and the model
vec = pickle.load(open("my_count_vec.pkl", "rb"))
sentiment_model = pickle.load(open("my_sentiment_model", "rb"))
@app.route('/journal', methods=['GET', 'POST'])
def entry():
if request.method == 'POST':
journals = request.form
entry_date = journals['entry_date']
journal_entry = journals['journal_entry']
sdf = vec.transform([journal_entry]).reshape(1, -1)
sentiment = sentiment_model.predict(sdf)
...
sdf = vec.transform([journal_entry]).reshape(1, -1)
assumes that journal entry is a single string and hence it needs reshaping for further processing.