I wrote a model in order to classify text as "hate speech" or "not hate speech" as a fun side project, however, when my model was done training it scored 0.00 on accuracy which I can't seem to think of an explanation for. I had around 15,000 examples of non toxic language and 15,000 examples of toxic language. Not a massive dataset but I thought enough for a model that has a fair degree of effectiveness. Regardless here is my code that I used to train my model (for reference my labels
NDArray has 30,000 entries with the first 15,000 being 1s and the last 15,000 being 0s as to line up with my array of samples.)
import tensorflow as tf
from clean import data
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
texts, labels = data()
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
vocab_size = len(tokenizer.word_index) + 1
max_seq_length = max(len(seq) for seq in sequences)
print(max_seq_length)
sequences = pad_sequences(sequences, maxlen=max_seq_length, padding='post')
labels = to_categorical(labels)
split_index = int(0.8 * len(sequences))
x_train, x_test = sequences[:split_index], sequences[split_index:]
y_train, y_test = labels[:split_index], labels[split_index:]
model = Sequential()
model.add(Embedding(input_dim=vocab_size,
output_dim=64, input_length=max_seq_length))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam', metrics=['accuracy'])
model.fit(x_train, y_train, epochs=10, batch_size=32, validation_split=0.1)
model.save("model_save")
loss, accuracy = model.evaluate(x_test, y_test)
print("Test loss:", loss)
print("Test accuracy:", accuracy)
Any advice or help would be appreciated thank you
If I had understood your data, the problem could be just a problem of the order of the labels in your label's array, they should be shuffled to prevent the model from being biased toward one label. you have to use shuffle methods depending on your data type (tf.data.Dataset ,tf Array, or Numpy Array ), for example, if your data is a tf array :
order = tf.random.shuffle(tf.range(tf.shape(x_train)[0]))
x_train = tf.gather(x_train, order)
y_train = tf.gather(y_train, order)