keras classification lstm recurrent-neural-network text-classification

Low recall and f1-score for LSTM Text classification

I am pretty new to Text Classificaiton with LSTM.

I am trying to classify social media data into hate (1) and nothate (0) using an LSTM without any pretrained word embeddings.

I did some pre-processing removing stopwords, lowercasing, lemmatization etc. and used tensorflow.keras.preprocessing.text.Tokenizer for tokenization, padded all entries to lenghth of 512 tokens.

My model is the following:

from tensorflow.keras.layers import LSTM, Dense, Embedding, Dropout
from tensorflow.keras.models import Sequential

model = Sequential()
model.add(Embedding(512, 200)
model.add(LSTM(128, return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(1, activation="sigmoid"))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

The model summary is the following:

The classification report is the following:

Before training the model i under-sampled the training data to have a balanced dataset. Test data remained unbalanced. Although precision, recall and f1-score are good for class not-hate, the f1-score and recall seem to perform poorly for hate speech.

Solution

As you can see from your results, your model can only identify about half of the hate sentences as such (recall=0.51). Thus, your model has problems separating not-hate from hate. Now the key question is: How can you help your model distinguish the two categories? In general, there are various approaches you can consider, namely 1) better data, 2) more data, 3) larger model, and/or 4) hyperparameter tuning. Below you can find a more elaborate explanation on my recommendation (1) and some references to the others.

Better data

Without knowing more about your problem/your data, this is the approach I would recommend you. Hate speech can oftentimes be quite subtle and implicit. Therefore, it is important to understand the context in which words are used. Although your code produces custom-trained word-embeddings, those won't be contextual, e.g. the embedding for the word dog will be exactly the same for dogs are awesome and dogs are lame. Therefore, to improve your model's ability to separate the two categories, you can look into contextualized word embeddings, e.g. the embeddings BERT uses. Note that you usually don't train custom, contextualized word-embeddings but fine-tune existing ones. If you're interested in learning more about how to customize BERT using TensorFlow, please read the guide here.

More data

This one is quite self-explanatory. Models thrive on big data, and maybe your model just hasn't seen enough data (you haven't provided any information about dataset size, and maybe your dataset size is 1,000 sentences while your model needs 100,000 to learn the relationship).

Larger model

Maybe you have enough data, but your model is not complex enough to capture the relationship. Following the takeaways of the Universal Approximation Theorem, a more complex model could help you capture the relationship that divides hate from no-hate. If you would like to learn more about this theorem, I found this lecture on Youtube very useful

Hyperparameter Tuning

Maybe you have enough data and also your model is of the right complexity, but you model configuration is wrong, e.g. your learning rate is too small, which is why your model is taking a very long time to learn the relationship. You can learn more about hyperparameter tuning on TensorFlow here