rregexdata.tablestringrstringi

Select and extract different capture groups from string using regex


I would like to extract various parts of a string using regex patterns and capturing groups. I am able to filter the string using str_match_all, but I would like to have the possibility to explicitely select one of the capturing groups, defined in the regex. The problem is that using that inside of a data.table does not yield the desired results.

dt.test <- data.table(file_names = c("20200131_20210228_PROD_TEST_MF_delta_20210228_20210107_20210210.csv"
                                   , "20200531_20210531_PROD_TEST_MF_delta_20210531_20210523_20210608.csv" ))

I am able to extract the various defined capturing groups using the following command:

stringi::stri_match_all(dt.test[1,file_names],regex = "(?i)(\\d*)\\_(\\d*)\\_([A-Z]*\\_[A-Z]*\\_[A-Z]*\\_[A-Z]*)\\_(\\d*)\\_(\\d*)\\_(\\d*)")
[[1]]
      [,1]                                                              [,2]       [,3]       [,4]                 [,5]       [,6]       [,7]      
 [1,] "20200131_20210228_PROD_TEST_MF_delta_20210228_20210107_20210210" "20200131" "20210228" "PROD_TEST_MF_delta" "20210228" "20210107" "20210210"

However, when accessing the resulting list and trying that command inside of a data.table it assigns the value of the first value for all rows of the data.table, which is kind of unexpected.

dt.test[,.(
file_names
 , Extract.1 = unlist(stringi::stri_match_all(file_names,regex = "(?i)(\\d*)\\_(\\d*)\\_([A-Z]*\\_[A-Z]*\\_[A-Z]*\\_[A-Z]*)\\_(\\d*)\\_(\\d*)\\_(\\d*)"))[3]
)]

Output:

                                                            file_names Extract.1
1: 20200131_20210228_PROD_TEST_MF_delta_20210228_20210107_20210210.csv  20210228
2: 20200531_20210531_PROD_TEST_MF_delta_20210531_20210523_20210608.csv  20210228

Expected Output:

                                                            file_names Extract.1
1: 20200131_20210228_PROD_TEST_MF_delta_20210228_20210107_20210210.csv  20210228
2: 20200531_20210531_PROD_TEST_MF_delta_20210531_20210523_20210608.csv  20210531

Basically, I think I am looking for a way to define which capture group of the regex to extract. A not so elegant workaround is split the string into columns using stringi::stri_match_first and then select the relevant column afterwards.

dt.test[,.(
  file_names
  , Extract.1 = stringi::stri_match_first(file_names,regex = "(?i)(\\d*)\\_(\\d*)\\_([A-Z]*\\_[A-Z]*\\_[A-Z]*\\_[A-Z]*)\\_(\\d*)\\_(\\d*)\\_(\\d*)")
)][,.(file_names,Extract.1.V2)] 



                                                            file_names Extract.1.V2
1: 20200131_20210228_PROD_TEST_MF_delta_20210228_20210107_20210210.csv     20200131
2: 20200531_20210531_PROD_TEST_MF_delta_20210531_20210523_20210608.csv     20200531

Solution

  • You could use the builtin function sub as follows:

    dt.test[, Extract.1 := sub(".*delta_(\\d+)_.*", "\\1", file_names)]
    
                                                                file_names Extract.1
    1: 20200131_20210228_PROD_TEST_MF_delta_20210228_20210107_20210210.csv  20210228
    2: 20200531_20210531_PROD_TEST_MF_delta_20210531_20210523_20210608.csv  20210531
    

    you could can use the function tstrsplit from data.table to split the variables into multiple groups and select the elements that are needed by specifying the argument keep:

    dt.test[, Extract.1 := tstrsplit(file_names, "_", keep=7)]
    
                                                                file_names Extract.1
    1: 20200131_20210228_PROD_TEST_MF_delta_20210228_20210107_20210210.csv  20210228
    2: 20200531_20210531_PROD_TEST_MF_delta_20210531_20210523_20210608.csv  20210531