The question below is an extension of this question.
I have example data as follows:
library(data.table)
example_dat <- fread("var_nam description
some_var this_is_some_var_kg
other_var this_is_meters_for_another_var
extra_var the_price_of_apples
another_var cost_of_goods_sold")
example_dat$description <- gsub("_", " ", example_dat$description)
var_nam description
1: some_var this is some var kg
2: other_var this is meters for another var
3: extra_var the price of apples
4: another_var cost of goods sold
vector_of_units <- c("kg", "meters", "var")
I first asked how to create a separate column in this data which looks for certain units listed in a vector (vector_of_units
). One option is to use this answer by maydin. Which gets out all matches.
library(tidyverse)
setDT(example_dat)[, unit := unlist(lapply(example_dat$description,function(x)
paste0(vector_of_units[str_detect(x,vector_of_units)],
collapse = ",")))]
var_nam description unit
1: some_var this is some var Kg kg,var
2: other_var this is meters for another var meters,var
3: extra_var the Price of apples
4: another_var cost of goods sold
I also found this answer by langtang, which gets out the first match (which is actually preferable in my situation):
example_dat[, unit:=stringr::str_extract(description, paste0(vector_of_units,collapse = "|"))]
var_nam description unit
1: some_var this is some var kg var
2: other_var this is meters for another var meters
3: extra_var the price of apples <NA>
4: another_var cost of goods sold <NA>
Based on vector of strings extract string from data.table column into new column
I would however like to have a little more flexibility.
Firstly, I would like to supply a vector of matches and a vector for pasting separately, so that I can change the hits in to something else:
vector_of_units_in <- c("kg", "meters", "var")
vector_of_units_out <- c("kilogram", "meters", "variable")
vector_of_units_euro <- c("cost", "price")
vector_of_units_euro_out <- "euro"
Secondly, I would like to be able to choose what happens when there is no hit. For example, when applying the solution by langtang, I want it to not overwrite the first variables with NA
.
I have been trying to mess around with langtang's solution:
setDT(example_dat)[, unit := ifelse(!is.na(stringr::str_extract(description, vector_of_units_in)), paste0(vector_of_units_out, collapse = "|"), NA)]
# NA has been replaced by unit, so that it is not overwritten in case of no match
setDT(example_dat)[, unit := ifelse(!is.na(stringr::str_extract(description, vector_of_units_euro)), paste0(vector_of_units_euro_out, collapse = "|"), unit)]
But I end with this:
var_nam description unit
1: some_var this is some var kg kilogram|meters|variable
2: other_var this is meters for another var kilogram|meters|variable
3: extra_var the price of apples <NA>
4: another_var cost of goods sold <NA>
How should I write this syntax?
var_nam description unit
1: some_var this is some var Kg kilogram
2: other_var this is meters for another var meters
3: extra_var the Price of apples euro
4: another_var cost of goods sold euro
You could use a named units vector and Vectorize
grep
for outer
. In a case handling if
no matches found, we can throw NA
.
units <- c(kilogram="kg", meters="meters", euro="cost", euro="price", variable='var')
dat[, unit:=apply(outer(units, description, Vectorize(grepl)), 2, \(x)
if (any(x)) names(which(x)) else NA)]
dat
# var_nam description unit
# 1: some_var this is some var kg kilogram,variable
# 2: some_var this is some NA
# 3: other_var this is meters for another var meters,variable
# 4: extra_var the price of apples euro
# 5: another_var cost of goods sold euro
Data:
dat <- structure(list(var_nam = c("some_var", "some_var", "other_var",
"extra_var", "another_var"), description = c("this is some var kg",
"this is some var", "this is meters for another var", "the price of apples",
"cost of goods sold")), row.names = c(NA, -5L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x558a7b025230>)