I would like to extract various parts of a string using regex patterns and capturing groups. I am able to filter the string using str_match_all
, but I would like to have the possibility to explicitely select one of the capturing groups, defined in the regex. The problem is that using that inside of a data.table does not yield the desired results.
dt.test <- data.table(file_names = c("20200131_20210228_PROD_TEST_MF_delta_20210228_20210107_20210210.csv"
, "20200531_20210531_PROD_TEST_MF_delta_20210531_20210523_20210608.csv" ))
I am able to extract the various defined capturing groups using the following command:
stringi::stri_match_all(dt.test[1,file_names],regex = "(?i)(\\d*)\\_(\\d*)\\_([A-Z]*\\_[A-Z]*\\_[A-Z]*\\_[A-Z]*)\\_(\\d*)\\_(\\d*)\\_(\\d*)")
[[1]]
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] "20200131_20210228_PROD_TEST_MF_delta_20210228_20210107_20210210" "20200131" "20210228" "PROD_TEST_MF_delta" "20210228" "20210107" "20210210"
However, when accessing the resulting list and trying that command inside of a data.table
it assigns the value of the first value for all rows of the data.table
, which is kind of unexpected.
dt.test[,.(
file_names
, Extract.1 = unlist(stringi::stri_match_all(file_names,regex = "(?i)(\\d*)\\_(\\d*)\\_([A-Z]*\\_[A-Z]*\\_[A-Z]*\\_[A-Z]*)\\_(\\d*)\\_(\\d*)\\_(\\d*)"))[3]
)]
Output:
file_names Extract.1
1: 20200131_20210228_PROD_TEST_MF_delta_20210228_20210107_20210210.csv 20210228
2: 20200531_20210531_PROD_TEST_MF_delta_20210531_20210523_20210608.csv 20210228
Expected Output:
file_names Extract.1
1: 20200131_20210228_PROD_TEST_MF_delta_20210228_20210107_20210210.csv 20210228
2: 20200531_20210531_PROD_TEST_MF_delta_20210531_20210523_20210608.csv 20210531
Basically, I think I am looking for a way to define which capture group of the regex to extract. A not so elegant workaround is split the string into columns using stringi::stri_match_first
and then select the relevant column afterwards.
dt.test[,.(
file_names
, Extract.1 = stringi::stri_match_first(file_names,regex = "(?i)(\\d*)\\_(\\d*)\\_([A-Z]*\\_[A-Z]*\\_[A-Z]*\\_[A-Z]*)\\_(\\d*)\\_(\\d*)\\_(\\d*)")
)][,.(file_names,Extract.1.V2)]
file_names Extract.1.V2
1: 20200131_20210228_PROD_TEST_MF_delta_20210228_20210107_20210210.csv 20200131
2: 20200531_20210531_PROD_TEST_MF_delta_20210531_20210523_20210608.csv 20200531
You could use the builtin function sub
as follows:
dt.test[, Extract.1 := sub(".*delta_(\\d+)_.*", "\\1", file_names)]
file_names Extract.1
1: 20200131_20210228_PROD_TEST_MF_delta_20210228_20210107_20210210.csv 20210228
2: 20200531_20210531_PROD_TEST_MF_delta_20210531_20210523_20210608.csv 20210531
you could can use the function tstrsplit
from data.table to split the variables into multiple groups and select the elements that are needed by specifying the argument keep
:
dt.test[, Extract.1 := tstrsplit(file_names, "_", keep=7)]
file_names Extract.1
1: 20200131_20210228_PROD_TEST_MF_delta_20210228_20210107_20210210.csv 20210228
2: 20200531_20210531_PROD_TEST_MF_delta_20210531_20210523_20210608.csv 20210531