I'm having a difficulties to understand R stemming word process.
In my example, i created the following corpus object
a <- Corpus(VectorSource("device so much more funand unlike most android torrent download clients"))
So a is
a[[1]]$content
[1] "device so much more funand unlike most android torrent download clients"
The first word in this string is "device", I created my term matrix
b <- TermDocumentMatrix(a, control = list(stemming = TRUE))
and got this as an output
dimnames(b)$Terms
[1] "android" "client" "devic" "download" "funand" "more" "most" "much" "torrent"
[10] "unlik"
What i like to know is why i lost the "e" at "device" and "unlike" but did not loss it at "more".
how can i avoid this from happening in this word and in some others?
Thanks.
Another option is to use the MorphAdorner lemmatizer at Northwestern University. This answer has the code for the lemmatize(...)
function.
library(tm)
a <- Corpus(VectorSource("device so much more funand unlike most android torrent download clients"))
words <- Terms(TermDocumentMatrix(a))
lemmatize(words)
# android clients device download funand more most much torrent unlike
# "android" "client" "device" "download" "funand" "more" "most" "much" "torrent" "unlike"
As you can see, it removes the "s" from "clients" but not the "e" from "device".