I am trying to create some word cloud in R, which I am managing well so far with the exception of one little problem. I don't know where these words/symbols are coming from, but the following words are also getting displayed in my word cloud:
and I can't seem to remove them.These words/symbols are not part of the original text and don't know why and how they are getting displayed in my word cloud and really need help in understanding how I can remove such unwanted words and why they are there. Below, I am attaching a screen shot of my word cloud for clarity, and added blue arrows to show where those words/symbols are located in the word cloud. I am also attaching my lines of code and the text I used for creating the word cloud. Any help is much appreciated and many thanks.
the_txt <- "
- The wealthiest country\n
- The highest proportion of wealthy population (population aged 40-49)\n
- The highest numbers of "rich business men and women" and "rich soil and land"\n
- The country with the highest "employed populstion" and "self employed" numbers
"
mydata <- Corpus(VectorSource(the_txt))
mydata <- mydata %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(stripWhitespace)
mydata <- tm_map(mydata, content_transformer(tolower))
mydata <- tm_map(mydata, removeNumbers)
mydata <- tm_map(mydata, removeWords, stopwords("english"))
mydata <- tm_map(mydata, stemDocument)
as.character(mydata[[1]])
minfreq_trigram<-1
token_delim <- " \\t\\r\\n.!?,;\"()"
tritoken <- NGramTokenizer(my data, Weka_control(min=1,max=3, delimiters = token_delim))
three_word <- data.frame(table(tritoken))
sort_three <- three_word[order(three_word$Freq, decreasing=TRUE),]
set.seed(1234)
wordcloud(sort_three$tritoken, sort_three$Freq,
random.order=FALSE, scale = c(3,0.4),
min.freq = minfreq_trigram,
colors = brewer.pal(8,"Dark2"),
max.words=200)
> as.character(mydata)
[1] "wealthiest countri highest proport wealthi popul popul age highest number rich busi men women rich soil land countri highest employ populst self employ number"
[2] "list(language = \"en\")"
[3] "list()"
you checked mydata[[1]] , explicitly looking at a part of mydata, but the rest has content, that you fed into NGramTokenizer and ultimitaly the wordcloud.
If you want to pass mydata[[1]]] instead of mydata I would think that would work out for you, and is a straightforward approach. I think the recommended approach is to use content()
i.e.
mycontent <- content(mydata)
to get the character vector out