rtmcorpusqdap

combining words in tm R is not achieving desired result


I am trying to combine a few words so that they count as one. In this example I want val and valuatin to be counted as valuation.

The code I have been using to try and do this is below:

#load in package
library(tm)

replaceWords <- function(x, from, keep){
  regex_pat <- paste(from, collapse = "|")
  gsub(regex_pat, keep, x)
}


oldwords <- c("val", "valuati")
newword  <- c("valuation")

TextDoc2 <- tm_map(TextDoc, replaceWords, from=oldwords, keep=newword)

However this does not work as expected. Any time there is val in a word it is now being replaced with valuation. For example equivalent becomes equivaluation. How do I get around this error and achieved my desired result?


Solution

  • Try this function -

    replaceWords <- function(x, from, keep){
      regex_pat <- sprintf('\\b(%s)\\b', paste(from, collapse = '|'))
      gsub(regex_pat, keep, x)
    }
    

    val matches with equivalent. Adding word boundaries stop that from happening.

    grepl('val', 'equivalent')
    #[1] TRUE
    grepl('\\bval\\b', 'equivalent')
    #[1] FALSE