rnlptmstemmingsnowball

stemDocment in tm package not working on past tense word


I have a file 'check_text.txt' that contains "said say says make made". I'd like to perform stemming on it to get "say say say make make". I tried to use stemDocument in tm package, as the following, but only get "said say say make made". Is there a way to perform stemming on past tense words? Is it necessary to do so in real-world natural language processing? Thanks!

filename = 'check_text.txt'
con <- file(filename, "rb")
text_data <- readLines(con,skipNul = TRUE)
close(con)
text_VS <- VectorSource(text_data)
text_corpus <- VCorpus(text_VS)
text_corpus <- tm_map(text_corpus, stemDocument, language = "english")
as.data.frame(text_corpus)$text

EDIT: I also tried wordStem in SnowballC package

> library(SnowballC)
> wordStem(c("said", "say", "says", "make", "made"))
[1] "said" "sai"  "sai"  "make" "made"

Solution

  • If there is a data set of irregular English verbs in a package, this task would be easy. I just do not know any packages with such data, so I chose to create my own database by scraping. I am not sure if this website covers all irregular words. If necessary, you want to search better websites to create your own database. Once you have your database, You can engage in your task.

    First, I used stemDocument() and clean up present forms with -s. Then, I collected past forms in words (i.e., past), infinitive forms of the past forms (i.e., inf1),identified the order of the past forms in temp. I further identified the positions of the past forms in temp. I finally replaced the sat forms with their infinitive forms. I repeated the same procedure for past participles.

    library(tm)
    library(rvest)
    library(dplyr)
    library(splitstackshape)
    
    
    ### Create a database
    x <- read_html("http://www.englishpage.com/irregularverbs/irregularverbs.html")
    
    x %>%
    html_table(header = TRUE) %>%
    bind_rows %>%
    rename(Past = `Simple Past`, PP = `Past Participle`) %>%
    filter(!Infinitive %in% LETTERS) %>%
    cSplit(splitCols = c("Past", "PP"),
           sep = " / ", direction = "long") %>%
    filter(complete.cases(.)) %>%
    mutate_each(funs(gsub(pattern = "\\s\\(.*\\)$|\\s\\[\\?\\]",
                          replacement = "",
                          x = .))) -> mydic
    
    ### Work on the task
    
    words <- c("said", "drawn", "say", "says", "make", "made", "done")
    
    ### says to say
    temp <- stemDocument(words)
    
    ### past forms become present form
    ### Collect past forms
    past <- mydic$Past[which(mydic$Past %in% temp)]
    
    ### Collect infinitive forms of past forms
    inf1 <- mydic$Infinitive[which(mydic$Past %in% temp)]
    
    ### Identify the order of past forms in temp
    ind <- match(temp, past)
    ind <- ind[is.na(ind) == FALSE]
    
    ### Where are the past forms in temp?
    position <- which(temp %in% past)
    
    temp[position] <- inf1[ind]
    
    ### Check
    temp
    #[1] "say"   "drawn" "say"   "say"   "make"  "make"  "done" 
    
    
    ### PP forms to infinitive forms (same as past forms)
    
    pp <- mydic$PP[which(mydic$PP %in% temp)]
    inf2 <- mydic$Infinitive[which(mydic$PP %in% temp)]
    ind <- match(temp, pp)
    ind <- ind[is.na(ind) == FALSE]
    position <- which(temp %in% pp)
    temp[position] <- inf2[ind]
    
    ### Check
    temp
    #[1] "say"  "draw" "say"  "say"  "make" "make" "do"