rtext-miningtm

Text-mining with the tm-package - word stemming


I am doing some text mining in R with the tm-package. Everything works very smooth. However, one problem occurs after stemming (http://en.wikipedia.org/wiki/Stemming). Obviously, there are some words, which have the same stem, but it is important that they are not "thrown together" (as those words mean different things).

For an example see the 4 texts below. Here you cannnot use "lecturer" or "lecture" ("association" and "associate") interchangeable. However, this is what is done in step 4.

Is there any elegant solution how to implement this for some cases/words manually (e.g. that "lecturer" and "lecture" are kept as two different things)?

texts <- c("i am member of the XYZ association",
"apply for our open associate position", 
"xyz memorial lecture takes place on wednesday", 
"vote for the most popular lecturer")

# Step 1: Create corpus
corpus <- Corpus(DataframeSource(data.frame(texts)))

# Step 2: Keep a copy of corpus to use later as a dictionary for stem completion
corpus.copy <- corpus

# Step 3: Stem words in the corpus
corpus.temp <- tm_map(corpus, stemDocument, language = "english")  

inspect(corpus.temp)

# Step 4: Complete the stems to their original form
corpus.final <- tm_map(corpus.temp, stemCompletion, dictionary = corpus.copy)  

inspect(corpus.final)

Solution

  • I'm not 100% sure what you're after and don't totally get how tm_map works. If I understand then the following works. As I understand you want to supply a list of words that should not be stemmed. I'm using the qdap package mostly because I'm lazy and it has a function mgsub I like.

    Note that I got frustrated with using mgsub and tm_map as it kept throwing an error so I just used lapply instead.

    texts <- c("i am member of the XYZ association",
        "apply for our open associate position", 
        "xyz memorial lecture takes place on wednesday", 
        "vote for the most popular lecturer")
    
    library(tm)
    # Step 1: Create corpus
    corpus.copy <- corpus <- Corpus(DataframeSource(data.frame(texts)))
    
    library(qdap)
    # Step 2: list to retain and indentifier keys
    retain <- c("lecturer", "lecture")
    replace <- paste(seq_len(length(retain)), "SPECIAL_WORD", sep="_")
    
    # Step 3: sub the words you want to retain with identifier keys
    corpus[seq_len(length(corpus))] <- lapply(corpus, mgsub, pattern=retain, replacement=replace)
    
    # Step 4: Stem it
    corpus.temp <- tm_map(corpus, stemDocument, language = "english")  
    
    # Step 5: reverse -> sub the identifier keys with the words you want to retain
    corpus.temp[seq_len(length(corpus.temp))] <- lapply(corpus.temp, mgsub, pattern=replace, replacement=retain)
    
    inspect(corpus)       #inspect the pieces for the folks playing along at home
    inspect(corpus.copy)
    inspect(corpus.temp)
    
    # Step 6: complete the stem
    corpus.final <- tm_map(corpus.temp, stemCompletion, dictionary = corpus.copy)  
    inspect(corpus.final)
    

    Basically it works by:

    1. subbing out a unique identifier key for the supplied "NO STEM" words (the mgsub)
    2. then you stem (using stemDocument)
    3. next you reverse it and sub the identifier keys with the "NO STEM" words (the mgsub)
    4. last complete the Stem (stemCompletion)

    Here's the output:

    ## >     inspect(corpus.final)
    ## A corpus with 4 text documents
    ## 
    ## The metadata consists of 2 tag-value pairs and a data frame
    ## Available tags are:
    ##   create_date creator 
    ## Available variables in the data frame are:
    ##   MetaID 
    ## 
    ## $`1`
    ## i am member of the XYZ associate
    ## 
    ## $`2`
    ##  for our open associate position
    ## 
    ## $`3`
    ## xyz memorial lecture takes place on wednesday
    ## 
    ## $`4`
    ## vote for the most popular lecturer