rif-statementstringrstringi

Wrapping a string extract function in an ifelse statement


The question below is an extension of this question.

Example data

I have example data as follows:

library(data.table)
example_dat <- fread("var_nam description
      some_var this_is_some_var_kg
      other_var this_is_meters_for_another_var
      extra_var the_price_of_apples
      another_var cost_of_goods_sold")
example_dat$description  <- gsub("_", " ", example_dat$description)

       var_nam                    description
1:    some_var            this is some var kg
2:   other_var this is meters for another var
3:   extra_var            the price of apples
4: another_var             cost of goods sold

vector_of_units <- c("kg", "meters", "var")

Previous solutions

I first asked how to create a separate column in this data which looks for certain units listed in a vector (vector_of_units). One option is to use this answer by maydin. Which gets out all matches.

library(tidyverse)
setDT(example_dat)[, unit :=    unlist(lapply(example_dat$description,function(x) 
                    paste0(vector_of_units[str_detect(x,vector_of_units)],
                    collapse = ",")))]

       var_nam                    description       unit
1:    some_var            this is some var Kg     kg,var
2:   other_var this is meters for another var meters,var
3:   extra_var            the Price of apples           
4: another_var             cost of goods sold         

I also found this answer by langtang, which gets out the first match (which is actually preferable in my situation):

example_dat[, unit:=stringr::str_extract(description, paste0(vector_of_units,collapse = "|"))]

       var_nam                    description   unit
1:    some_var            this is some var kg    var
2:   other_var this is meters for another var meters
3:   extra_var            the price of apples   <NA>
4: another_var             cost of goods sold   <NA>

Based on vector of strings extract string from data.table column into new column

More flexibility with an ifelse statement

I would however like to have a little more flexibility.

Firstly, I would like to supply a vector of matches and a vector for pasting separately, so that I can change the hits in to something else:

vector_of_units_in <- c("kg", "meters", "var")
vector_of_units_out <- c("kilogram", "meters", "variable")

vector_of_units_euro <- c("cost", "price")
vector_of_units_euro_out <- "euro"

Secondly, I would like to be able to choose what happens when there is no hit. For example, when applying the solution by langtang, I want it to not overwrite the first variables with NA.

I have been trying to mess around with langtang's solution:

setDT(example_dat)[, unit := ifelse(!is.na(stringr::str_extract(description, vector_of_units_in)), paste0(vector_of_units_out, collapse = "|"), NA)]

# NA has been replaced by unit, so that it is not overwritten in case of no match
setDT(example_dat)[, unit := ifelse(!is.na(stringr::str_extract(description, vector_of_units_euro)), paste0(vector_of_units_euro_out, collapse = "|"), unit)]

But I end with this:

       var_nam                    description                     unit
1:    some_var            this is some var kg kilogram|meters|variable
2:   other_var this is meters for another var kilogram|meters|variable
3:   extra_var            the price of apples                     <NA>
4: another_var             cost of goods sold                     <NA>

How should I write this syntax?

Desired output

       var_nam                    description       unit
1:    some_var            this is some var Kg     kilogram
2:   other_var this is meters for another var     meters
3:   extra_var            the Price of apples     euro      
4: another_var             cost of goods sold     euro    

Solution

  • You could use a named units vector and Vectorize grep for outer. In a case handling if no matches found, we can throw NA.

    units <- c(kilogram="kg", meters="meters", euro="cost", euro="price", variable='var')
    
    dat[, unit:=apply(outer(units, description, Vectorize(grepl)), 2, \(x) 
                      if (any(x)) names(which(x)) else NA)]
    dat
    # var_nam                    description              unit
    # 1:    some_var            this is some var kg kilogram,variable
    # 2:    some_var                   this is some                NA
    # 3:   other_var this is meters for another var   meters,variable
    # 4:   extra_var            the price of apples              euro
    # 5: another_var             cost of goods sold              euro
    

    Data:

    dat <- structure(list(var_nam = c("some_var", "some_var", "other_var", 
    "extra_var", "another_var"), description = c("this is some var kg", 
    "this is some var", "this is meters for another var", "the price of apples", 
    "cost of goods sold")), row.names = c(NA, -5L), class = c("data.table", 
    "data.frame"), .internal.selfref = <pointer: 0x558a7b025230>)