I'm analysing a column with words in my most_used_words
dataframe. With 2180 words.
most_used_words
word times_used
<chr> <int>
1 people 70
2 news 69
3 fake 68
4 country 54
5 media 44
6 u.s 42
7 election 40
8 jobs 37
9 bad 36
10 democrats 35
# ... with 2,170 more rows
When I inner_join
with the AFINN lexicon only 364 of the 2180 words are scored. Is this because the words in the in the AFINN lexicon don't appear in my dataframe? I'm affraid if that's the case this may introduce bias in my analysis. Should I use a different lexicon? Is there something else that's happening?
library(tidytext)
library(tidyverse)
afinn <- get_sentiments("afinn")
most_used_words %>%
inner_join(afinn)
word times_used score
<chr> <int> <int>
1 fake 68 -3
2 bad 36 -3
3 win 24 4
4 failing 21 -2
5 hard 20 -1
6 united 19 1
7 illegal 17 -3
8 cuts 15 -1
9 badly 13 -3
10 strange 13 -1
# ... with 354 more rows
"Is this because the words in the in the AFINN lexicon don't appear in my dataframe?"
Yes.
An inner join will only return matching rows (words) from each data.frame.
You can try a different lexicon, sure, but that might not help you with nouns. A noun identifies a person, animal, place, thing, or idea. In your example above, "u.s.", "people", "country", "news", "democrats" are all nouns that don't exist in afinn
. None of these have any sentiment without context. Welcome to the world of text analysis.
However, based on the output displayed from you analysis, I think you can conclude the sentiment of your column of words is overwhelmingly "negative". The word "fake" appears nearly twice as much as the next most used word, which is "bad".
If you had complete sentences, you can gain context by using the the sentimentr
r package. Check it out:
install.packages("sentimentr")
library(sentimentr)
?sentiment
It will take more work than what you've done here, and will produce richer results. But in the end, they will likely be the same. Good luck.