rregexgrepl

can a variable be used to form regex pattern in R


I have a variation on this post where I need to build a comma-separated field of values based on codes across 5 columns. I need to examine each column to determine if any one of several different crops are present in that year. Each crop can be represented by multiple codes, though. I can do a series of grepl statements with each combination of codes, but that will get very long. Can the grepl pattern be built from variables? Or is there another way to do this?

Here is an example limiting the data to 4 years, 3 crops (corn, soybeans, and wheat), and just a couple different codes each.

corn_codes <- 1
soybean_codes <- c(2,446)
wheat_codes <- c(3,427)

df <- 
  crossing(yr2019=as.character(c(1,2,3,446,427)),
           yr2020=as.character(c(1,2,3,446,427)),
           yr2021=as.character(c(1,2,3,446,427)),
           yr2022=as.character(c(1,2,3,446,427))) %>%
  rowwise() %>%
  mutate(new_column = case_when(
    grepl("1,.*2,.*3|1,.*3,.*2|2,.*1,.*3|2,.*3,.*1|3,.*1,.*2|3,.*2,.*1",
          paste(as.vector(distinct(as.data.frame(c(yr2019,yr2020,yr2021,yr2022))))[[1]], 
                collapse = ",")) ~ "1,2,3",
    grepl("1,.*446,.*3|1,.*3,.*446|446,.*1,.*3|446,.*3,.*1|3,.*1,.*446|3,.*446,.*1",
          paste(as.vector(distinct(as.data.frame(c(yr2019,yr2020,yr2021,yr2022))))[[1]], 
                collapse = ",")) ~ "1,2,3",
    grepl("1,.*2,.*427|1,.*427,.*2|2,.*1,.*427|2,.*427,.*1|427,.*1,.*2|427,.*2,.*1",
          paste(as.vector(distinct(as.data.frame(c(yr2019,yr2020,yr2021,yr2022))))[[1]], 
                collapse = ",")) ~ "1,2,3",
    grepl("1,.*446,.*427|1,.*427,.*446|446,.*1,.*427|446,.*427,.*1|427,.*1,.*446|427,.*446,.*1",
          paste(as.vector(distinct(as.data.frame(c(yr2019,yr2020,yr2021,yr2022))))[[1]], 
                collapse = ",")) ~ "1,2,3",
    TRUE ~ "x"
  ) # end case_when
  ) %>% # end mutate
  print(n=20)

First 20/625 rows:

# A tibble: 625 × 5
# Rowwise: 
   yr2019 yr2020 yr2021 yr2022 new_column
   <chr>  <chr>  <chr>  <chr>  <chr>     
 1 1      1      1      1      x         
 2 1      1      1      2      x         
 3 1      1      1      3      x         
 4 1      1      1      427    x         
 5 1      1      1      446    x         
 6 1      1      2      1      x         
 7 1      1      2      2      x         
 8 1      1      2      3      1,2,3     
 9 1      1      2      427    1,2,3     
10 1      1      2      446    x         
11 1      1      3      1      x         
12 1      1      3      2      1,2,3     
13 1      1      3      3      x         
14 1      1      3      427    1,2,3     
15 1      1      3      446    1,2,3     
16 1      1      427    1      x         
17 1      1      427    2      1,2,3     
18 1      1      427    3      x         
19 1      1      427    427    x         
20 1      1      427    446    1,2,3     
# … with 605 more rows

This looks for any of the three crops occurring in the four years, in any order. There are two issues:

  1. The complexity of having a grepl line for every possible combination of the multiple crop codes.
  2. Because they are numbers, there is an issue with incorrect interpretation as this is currently written. "2" is a code for soybeans, but one of the wheat codes ("427") has a 2 in it, and gets picked up as soybean where grepl looks for ".*2", as you can see in row 14. Row 14 should have an "x" because it only actually includes corn and wheat. (Also see row 54 and others, not shown.) If it makes the solution easier, these could be converted to letters, like in the previous post referred to, above.

All suggestions are appreciated.


Solution

  • This just seems the wrong way to approach the analysis. It seems to me that it would be much easier to switch the codes for the crop names, then use a simple ifelse to determine whether all crops are present:

    df %>%
      mutate(across(everything(), ~ case_when(.x %in% corn_codes ~ "corn",
                                              .x %in% soybean_codes ~ "soybean",
                                              .x %in% wheat_codes ~ "wheat",
                                              TRUE ~ "other"))) %>%
      rowwise() %>%
      mutate(new_column = ifelse(all(c("corn", "soybean", "wheat") %in% 
                                       c_across(everything())), "1, 2, 3", "x")) 
    #> # A tibble: 625 x 5
    #> # Rowwise: 
    #>    yr2019 yr2020 yr2021  yr2022  new_column
    #>    <chr>  <chr>  <chr>   <chr>   <chr>     
    #>  1 corn   corn   corn    corn    x         
    #>  2 corn   corn   corn    soybean x         
    #>  3 corn   corn   corn    wheat   x         
    #>  4 corn   corn   corn    wheat   x         
    #>  5 corn   corn   corn    soybean x         
    #>  6 corn   corn   soybean corn    x         
    #>  7 corn   corn   soybean soybean x         
    #>  8 corn   corn   soybean wheat   1, 2, 3   
    #>  9 corn   corn   soybean wheat   1, 2, 3   
    #> 10 corn   corn   soybean soybean x         
    #> 11 corn   corn   wheat   corn    x         
    #> 12 corn   corn   wheat   soybean 1, 2, 3   
    #> 13 corn   corn   wheat   wheat   x         
    #> 14 corn   corn   wheat   wheat   x         
    #> 15 corn   corn   wheat   soybean 1, 2, 3   
    #> 16 corn   corn   wheat   corn    x         
    #> 17 corn   corn   wheat   soybean 1, 2, 3   
    #> 18 corn   corn   wheat   wheat   x         
    #> 19 corn   corn   wheat   wheat   x         
    #> 20 corn   corn   wheat   soybean 1, 2, 3   
    #> # i 605 more rows
    

    Created on 2023-08-23 with reprex v2.0.2