openrefinegrel

OPenRefine GREL removing words if present in another column


I have a big list of medication names and want to remove the words that are present in column 'delete'. I've tried multiple different methods and non have worked.

enter image description here


Solution

  • There are several approaches to solve this in OpenRefine. For all of them I am assuming you are in a transform dialog on the column where you want to remove the words. I am also assuming that you only want to remove one word, as shown in your example data.

    Option 1: replace

    value.replace(row.cells["delete"].value, "")
    

    This approach does not work, if the words in delete may be part of a word. For example "TAB" would also be replaced in "AMATABBON". You might use regular expressions for that. But afaik GREL does not have a way to dynamically create regular expressions (yet).

    Option 2: Jython

    import re
    regex = re.compile(r'\b'+re.escape(row.cells["delete"].value)+r'\b')
    return regex.sub("", value)
    

    Option 3: filter

    filter(value.split(" "), v, v != row.cells["delete"].value).join(" ")
    

    This expression will split the words on the separator whitespace " ", remove words that are in the corresponding row in column delete and then join the remaining words back together on whitespace " ". You may also expand this to consider multiple words in columns delete:

    with(
      row.cells["delete"].value.split(" "),
      forbiddenWords,
      filter(value.split(" "), v, not(forbiddenWords.inArray(v))).join(" ")
    )