I am trying to build a model to predict "species" based on data with features "message", "tail", and "finger", and label "species"(see the first few rows of data.csv below):
message | fingers | tail | species |
---|---|---|---|
pluvia arbor aquos | 4 | no | Aquari |
cosmix xeno nebuz odbitaz | 5 | yes | Zorblax |
solarix glixx novum galaxum quasar | 5 | yes | Zorblax |
arbor insectus pesros ekos dootix nimbus | 2 | yes | Florian |
My code is:
import warnings
warnings.simplefilter("ignore")
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
df = pd.read_csv("data.csv")
X = np.asarray(df[["message", "fingers", "tail"]])
X = [str (item) for item in X]
y = df["species"]
le = LabelEncoder()
y = le.fit_transform(y)
cv = CountVectorizer()
X = cv.fit_transform(X).toarray()
model = MultinomialNB()
model.fit(X, y)
test_data = pd.read_csv('test.csv')
test_data_array = np.asarray(df[["message", "fingers", "tail"]])
test_data_array = [str (item) for item in test_data_array]
test_data_array = cv.fit_transform(test_data_array).toarray()
y_prediction = model.predict(test_data_array)
y_prediction = le.inverse_transform(y_prediction)
print(y_prediction)
I followed this tutorial for the same.
The problem is, when I tried running it, it just outputs the species column of the original training data word-for-word apart from a few differences (there are 493 results while the test data consisted of 299 entries, and the training data consisted of 500 entries). It doesn't actually predict anything for the test data. I don't understand why the code won't work. Could someone help out?
The problem is that you read the test data into test_data
, but then use the original DataFrame, df
, containing the training data, to make the test set.
Change this line:
test_data_array = np.asarray(df[["message", "fingers", "tail"]])
To:
test_data_array = np.asarray(test_data[["message", "fingers", "tail"]])
And you should have the correct number of predictions.
Remember to also compare y_prediction
to test_data['species']
.