rnlptmsnowball

steamming words with r


I'm having a difficulties to understand R stemming word process.

In my example, i created the following corpus object

a <- Corpus(VectorSource("device so much more funand  unlike most android torrent download clients"))

So a is

a[[1]]$content

[1] "device so much more funand  unlike most android torrent download clients"

The first word in this string is "device", I created my term matrix

b <- TermDocumentMatrix(a, control = list(stemming = TRUE)) 

and got this as an output

dimnames(b)$Terms
[1] "android"  "client"   "devic"    "download" "funand"   "more"     "most"      "much"     "torrent" 
[10] "unlik"

What i like to know is why i lost the "e" at "device" and "unlike" but did not loss it at "more".

how can i avoid this from happening in this word and in some others?

Thanks.


Solution

  • Another option is to use the MorphAdorner lemmatizer at Northwestern University. This answer has the code for the lemmatize(...) function.

    library(tm)
    a     <- Corpus(VectorSource("device so much more funand  unlike most android torrent download clients"))
    words <- Terms(TermDocumentMatrix(a))
    lemmatize(words)
    #    android    clients     device   download     funand       more       most       much    torrent     unlike 
    #  "android"   "client"   "device" "download"   "funand"     "more"     "most"     "much"  "torrent"   "unlike" 
    

    As you can see, it removes the "s" from "clients" but not the "e" from "device".