[SOLVED] Grepl with logical operator AND across multiple alphanumeric columns

Grepl with logical operator AND across multiple alphanumeric columns

I have an enormous dataset that contains 25 columns of medical codes. Each row represents one medical visit. I need to create a new column that flags where two codes appear together on each row. In other words, I want to grep across multiple columns and flag where the two codes are present together the medical visit.

I thougt about creating a new variable for each alphanumeric code I want to grep on, then creating a final variables with case_when(), but is there a faster way to do this?

Here's a toy data set:

diag_p <- c('a1', 'a4', 'c5', 'a4', 'b1')
odiag1 <- c('b1', 'b2', 'c3', 'd4', 'e5')
odiag2 <- c('f1', 'g4', 'h4', 'i5', 'a1')
odiag3 <- c('a6', 'b1', 'c8', 'a1', 'e10')
sample_df <- data.frame(diag_p, odiag1, odiag2, odiag3)

This code works well to search across columns with | and two grep statements, and with >1 at the end of the chunk, which will count more than one match across columns, but it doesn't work quite right. I need it to match (a1 or a4) & (b1 or b4).

new_df <- sample_df %>%
    mutate(
        new_col = rowSums(sapply(select(., diag_p, odiag1:odiag3),
            function(x) (grepl("a[14]", x)) | (grepl("b[14]", x)))) > 1
        )

Is there a way to do this without making one new column for each grep statement, then making a final variable with case_when()?

I updated the code to get rid of the case_when() which I see was confusing folks. This is what I want the code to look like, so that (a1 or a4) AND (b1 or b4) produce a match TRUE or FALSE on each row:

structure(list(diag_p = c("a1", "a4", "c5", "a4", "b1"), odiag1 = c("b1", 
"b2", "c3", "d4", "e5"), odiag2 = c("f1", "g4", "h4", "i5", "a1"
), odiag3 = c("a6", "b1", "c8", "a1", "e10"), new_col = c(TRUE, 
TRUE, FALSE, FALSE, TRUE)), row.names = c(NA, -5L), class = "data.frame")

Solution

Using base R:

sample_df$flag <-
  apply(sample_df, 1, \(row) {
    any(grepl("a[14]", row)) & any(grepl("b[14]", row))
  })

sample_df
#>   diag_p odiag1 odiag2 odiag3 new_col
#> 1     a1     b1     f1     a6    TRUE
#> 2     a4     b2     g4     b1    TRUE
#> 3     c5     c3     h4     c8   FALSE
#> 4     a4     d4     i5     a1   FALSE
#> 5     b1     e5     a1    e10    TRUE