rnlpword-cloud

Unnecessary words included in the Word cloud created using R programming


I am trying to create some word cloud in R, which I am managing well so far with the exception of one little problem. I don't know where these words/symbols are coming from, but the following words are also getting displayed in my word cloud:

and I can't seem to remove them.These words/symbols are not part of the original text and don't know why and how they are getting displayed in my word cloud and really need help in understanding how I can remove such unwanted words and why they are there. Below, I am attaching a screen shot of my word cloud for clarity, and added blue arrows to show where those words/symbols are located in the word cloud. I am also attaching my lines of code and the text I used for creating the word cloud. Any help is much appreciated and many thanks. enter image description here

    the_txt <-  "
  - The wealthiest country\n
  - The highest proportion of wealthy population (population aged 40-49)\n
  - The highest numbers of "rich business men and women" and "rich soil and land"\n
  - The country with the highest "employed populstion" and "self employed" numbers

  "

    mydata <- Corpus(VectorSource(the_txt))

mydata <- mydata %>%
    tm_map(removeNumbers) %>%
    tm_map(removePunctuation) %>%
    tm_map(stripWhitespace)

mydata <- tm_map(mydata, content_transformer(tolower))

mydata <- tm_map(mydata, removeNumbers)

mydata <- tm_map(mydata, removeWords, stopwords("english"))

mydata <- tm_map(mydata, stemDocument)


as.character(mydata[[1]])

minfreq_trigram<-1

token_delim <- " \\t\\r\\n.!?,;\"()"

tritoken <- NGramTokenizer(my data, Weka_control(min=1,max=3, delimiters = token_delim))

three_word <- data.frame(table(tritoken))

sort_three <- three_word[order(three_word$Freq, decreasing=TRUE),]

set.seed(1234)

wordcloud(sort_three$tritoken, sort_three$Freq, 
              random.order=FALSE, scale = c(3,0.4),
              min.freq = minfreq_trigram,
              colors = brewer.pal(8,"Dark2"),
              max.words=200)

Solution

  • > as.character(mydata)
    [1] "wealthiest countri highest proport wealthi popul popul age highest number rich busi men women rich soil land countri highest employ populst self employ number"
    [2] "list(language = \"en\")"                                                                                                                                       
    [3] "list()" 
    

    you checked mydata[[1]] , explicitly looking at a part of mydata, but the rest has content, that you fed into NGramTokenizer and ultimitaly the wordcloud. If you want to pass mydata[[1]]] instead of mydata I would think that would work out for you, and is a straightforward approach. I think the recommended approach is to use content()

    i.e.

    mycontent <- content(mydata)
    

    to get the character vector out