nlpsentiment-analysisbert-language-modeldata-preprocessing

Capitalized words in sentiment analysis


I'm currently working with data of customers reviews on products from Sephora. my task to classify them to sentiments : negative, neutral , positive . A common technique of text preprocessing is to lower case all the words , but in this situation upper case words like 'AMAZING' can hide significant emotion behind them and turning all the word to lower case can cause information loss. would be happy for your opinion in the subject should i still lower case all the words? i personally think about creating more classes and distinction between sentiments as good , very good than just positive to include the importance of this upper case words .

this is my current code :

from itertools import chain

def is_upper_case(text):
  return [word for word in text.split() if word.isupper() and word != 'I']

unique_upper_words = set(chain.from_iterable(all_reviews['review_text'].apply(is_upper_case)))
print(unique_upper_words)

Solution

  • If you are using a BERT-based model (or any other LLM) to do the actual classification I would recommend to not use any preprocessing at all (at least when it comes to capitalization), as these models were pre-trained on non-preprocessed data.

    If you want to then do any kind of analysis on the resulting labeled sentences you could lowercase everything to group n-grams and to simplify the analysis.

    If you are thinking about having multiple classes to have a better distinction between the prediction, I think it would make most sense if you switch to a sentiment regression instead of a classification, where you predict a value in a continuous range. This comes somewhat natural to the fine-tuning of language models as in a normal classification you would take a continuous output from the model and map it to categorical classes using something like softmax, so for your needs you can just skip that last step and directly use the model output. Many python ML frameworks for fine-tuning or using language models have their own classes for regression tasks, check out this repository as an example.