The following are a series of e mail subjects. DF- data.frame. Note I have imported this from an excel sheet.
EmailSubject
Buy the stunning new phone
The game changer is here.
Experience a phone ahead of its time.
Thank You Chennai
Limited Period offer
Valentines day special
Buy a phone at 10000 and get a new sim free
Limited Period offer
Valentines day special
Buy a phone at 10000 and get a new sim free
Buy the stunning new phone
The game changer is here.
Experience a phone ahead of its time.
Thank You Chennai
Limited Period offer
Valentines day special
Buy a phone at 10000 and get a new sim free
Thank You Chennai
Limited Period offer
Valentines day special
Buy a phone at 10000 and get a new sim free
Buy a phone at 10000 and get a new sim free
Buy the stunning new phone
The game changer is here.
I have created a term document matrix in R with the following code
require(tm)
mytext<-DF$EmailSubject
mycorpus<-Corpus(VectorSource(mytext))
mycorpus<-tm_map(mycorpus,removePunctuation)
mycorpus<-tm_map(mycorpus, removeNumbers)
mycorpus<-tm_map(mycorpus, tolower)
mycorpus<-tm_map(mycorpus, removeWords, stopwords("english"))
# # Create a term diocumentmatrix
dtm<-TermDocumentMatrix(mycorpus)
m<-as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)
This yields the following term document matrix
word freq
get 45
free 44
edge 35
new 29
buy 24
charger 23
wireless 23
just 21
month 21
per 21
starting 21
stunning 21
pro 20
now 17
offers 17
gear 16
exclusive 15
offer 14
gift 13
irresistible 10
loved 10
valentine’s 10
I m getting a term document matrix. However, some words appear with ’ special characters only in the term document matrix- they arent present in the original data frame. I have tried adjusting the encoding and have manually removed the same with Gsub. Is there a way to avoid the words from my excel sheet being processed with special characters.
gsub("€™", "", d$word)
This is a manual method. Is there an automatic method. The encoding is UTF-8. Are there packages that enable us to avoid this error
This should help you :
Encoding(x) <- "UTF-8"
iconv(dtm, "UTF-8", "ASCII", sub="")