Im trying a sentiment analysis based approach on youtube comments, but the comments many times have words like mrbeast, tiger/'s, lion/'s, pewdiepie, james, etc which do not add any feeling in the sentence. I've gone through nltk's average_perception_tagger but it didn't work well as it gave the results as
my input:
"mrbeast james lion tigers bad sad clickbait fight nice good"
words that i need in my sentence:
"bad sad clickbait fight nice good"
what i got using average_perception_tagger:
[('mrbeast', 'NN'),
('james', 'NNS'),
('lion', 'JJ'),
('tigers', 'NNS'),
('bad', 'JJ'),
('sad', 'JJ'),
('clickbait', 'NN'),
('fight', 'NN'),
('nice', 'RB'),
('good', 'JJ')]
so as you can see if i remove mrbeast i.e NN the words like clickbait, fight will also get removed which than ultimately remove expressions from that sentence.
There are multiple ways of doing this like
you can create a set of positive and negative words and for each word in your grammar you can check if it exists in your set, if it does you should keep the word, else delete it. This however would first require all positive and negative words dataset.
you can use something like textblob which can give you the sentiment score of a word or a sentence. so with a cutoff sentiment score you can filter out the words that you don't need.