rgsubabbreviation

R: Replace Abbreviations\ Words


I have tried to resolve this problem all day but without any improvement.

I am trying to replace the following abbreviations into the following desired words in my dataset:

-Abbreviations: USA, H2O, Type 3, T3, bp

The input data is for example

The desired output is

I have tried the following code but without success:

   data= read.csv(C:"xxxxxxx, header= TRUE")
   lowercase= tolower(data$MESSAGE)
   dict=list("\\busa\\b"= "united states of america", "\\bh2o\\b"= 
   "water", "\\btype 3\\b|\\bt3\\"= "type 3 disease", "\\bbp\\b"= 
   "blood pressure")
   for(i in 1:length(dict1)){
   lowercasea= gsub(paste0("\\b", names(dict)[i], "\\b"), 
   dict[[i]], lowercase)}

I know that I am definitely doing something wrong. Could anyone guide me on this? Thank you in advance.


Solution

  • If you need to replace only whole words (e.g. bp in Some bp. and not in bpcatalogue) you will have to build a regular expression out of the abbreviations using word boundaries, and - since you have multiword abbreviations - also sort them by length in the descending order (or, e.g. type may trigger a replacement before type three).

    An example code:

    abbreviations <- c("USA", "H2O", "Type 3", "T3", "bp")
    desired_words <- c("United States of America", "Water", "Type 3 Disease", "Type 3 Disease", "blood pressure")
    df <- data.frame(abbreviations, desired_words, stringsAsFactors = FALSE)
    x <- 'Abbreviations: USA, H2O, Type 3, T3, bp'
    sort.by.length.desc <- function (v) v[order( -nchar(v)) ]
    
    library(stringr)
    str_replace_all(x, 
        paste0("\\b(",paste(sort.by.length.desc(abbreviations), collapse="|"), ")\\b"), 
        function(z) df$desired_words[df$abbreviations==z][[1]][1]
    ) 
    

    The paste0("\\b(",paste(sort.by.length.desc(abbreviations), collapse="|"), ")\\b") code creates a regex like \b(Type 3|USA|H2O|T3|bp)\b, it matches Type 3, or USA, etc. as whole word only as \b is a word boundary. If a match is found, stringr::str_replace_all replaces it with the corresponding desired_word.

    See the R demo online.