python machine-learning text-classification naivebayes machine-learning-model

Machine learning model predicts training labels themselves as result

I am trying to build a model to predict "species" based on data with features "message", "tail", and "finger", and label "species"(see the first few rows of data.csv below):

message	fingers	tail	species
pluvia arbor aquos	4	no	Aquari
cosmix xeno nebuz odbitaz	5	yes	Zorblax
solarix glixx novum galaxum quasar	5	yes	Zorblax
arbor insectus pesros ekos dootix nimbus	2	yes	Florian

My code is:

import warnings
warnings.simplefilter("ignore")
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

df = pd.read_csv("data.csv")
X = np.asarray(df[["message", "fingers", "tail"]])
X = [str (item) for item in X]
y = df["species"]

le = LabelEncoder()
y = le.fit_transform(y)

cv = CountVectorizer()
X = cv.fit_transform(X).toarray()

model = MultinomialNB()
model.fit(X, y)

test_data = pd.read_csv('test.csv')
test_data_array = np.asarray(df[["message", "fingers", "tail"]])
test_data_array = [str (item) for item in test_data_array]
test_data_array = cv.fit_transform(test_data_array).toarray()

y_prediction = model.predict(test_data_array)
y_prediction = le.inverse_transform(y_prediction)

print(y_prediction)

I followed this tutorial for the same.

The problem is, when I tried running it, it just outputs the species column of the original training data word-for-word apart from a few differences (there are 493 results while the test data consisted of 299 entries, and the training data consisted of 500 entries). It doesn't actually predict anything for the test data. I don't understand why the code won't work. Could someone help out?

Solution

The problem is that you read the test data into test_data, but then use the original DataFrame, df, containing the training data, to make the test set.

Change this line:

test_data_array = np.asarray(df[["message", "fingers", "tail"]])

To:

test_data_array = np.asarray(test_data[["message", "fingers", "tail"]])

And you should have the correct number of predictions.

Remember to also compare y_prediction to test_data['species'].