rdataframegsubterm-document-matrix

Generic way to avoid special characters in R


The following are a series of e mail subjects. DF- data.frame. Note I have imported this from an excel sheet.

  EmailSubject
 Buy the stunning new phone
 The game changer is here.
  Experience a phone ahead of its time.
  Thank You Chennai
   Limited Period offer
   Valentines day special
  Buy a phone at 10000 and get a new sim free
   Limited Period offer
  Valentines day special
  Buy a phone at 10000 and get a new sim free
  Buy the stunning new phone
  The game changer is here.
  Experience a phone ahead of its time.
  Thank You Chennai
   Limited Period offer
   Valentines day special
  Buy a phone at 10000 and get a new sim free
 Thank You Chennai
Limited Period offer
 Valentines day special
 Buy a phone at 10000 and get a new sim free
 Buy a phone at 10000 and get a new sim free
 Buy the stunning new phone
 The game changer is here.

I have created a term document matrix in R with the following code

 require(tm)
 mytext<-DF$EmailSubject
 mycorpus<-Corpus(VectorSource(mytext))
 mycorpus<-tm_map(mycorpus,removePunctuation)
 mycorpus<-tm_map(mycorpus, removeNumbers)
 mycorpus<-tm_map(mycorpus, tolower)
 mycorpus<-tm_map(mycorpus, removeWords, stopwords("english"))


    # # Create a term diocumentmatrix
    dtm<-TermDocumentMatrix(mycorpus)
     m<-as.matrix(dtm)
     v <- sort(rowSums(m),decreasing=TRUE)
     d <- data.frame(word = names(v),freq=v)
     head(d, 10)

This yields the following term document matrix

                          word freq

                          get   45
                          free   44
                          edge   35

                          new   29
                          buy   24
                        charger   23
                        wireless   23
                          just   21
                          month   21
                            per   21
                        starting   21
                        stunning   21
                            pro   20
                            now   17
                         offers   17
                           gear   16
                       exclusive   15
                          offer   14
                           gift   13

                       irresistible   10
                           loved   10
                    valentine’s   10

I m getting a term document matrix. However, some words appear with ’ special characters only in the term document matrix- they arent present in the original data frame. I have tried adjusting the encoding and have manually removed the same with Gsub. Is there a way to avoid the words from my excel sheet being processed with special characters.

gsub("€™", "", d$word)

This is a manual method. Is there an automatic method. The encoding is UTF-8. Are there packages that enable us to avoid this error


Solution

  • This should help you :

    Encoding(x) <- "UTF-8"
    
    iconv(dtm, "UTF-8", "ASCII", sub="")