rsapplygreplalphanumericicd

Grepl with logical operator AND across multiple alphanumeric columns


I have an enormous dataset that contains 25 columns of medical codes. Each row represents one medical visit. I need to create a new column that flags where two codes appear together on each row. In other words, I want to grep across multiple columns and flag where the two codes are present together the medical visit.

I thougt about creating a new variable for each alphanumeric code I want to grep on, then creating a final variables with case_when(), but is there a faster way to do this?

Here's a toy data set:


    diag_p <- c('a1', 'a4', 'c5', 'a4', 'b1')
    odiag1 <- c('b1', 'b2', 'c3', 'd4', 'e5')
    odiag2 <- c('f1', 'g4', 'h4', 'i5', 'a1')
    odiag3 <- c('a6', 'b1', 'c8', 'a1', 'e10')
    sample_df <- data.frame(diag_p, odiag1, odiag2, odiag3)

This code works well to search across columns with | and two grep statements, and with '>1' at the end of the chunk, which will count more than one match across columns, but it doesn't work quite right. I need it to match (a1 or a4) & (b1 or b4).


    new_df <- sample_df %>% 
      mutate(
              new_col = rowSums(sapply(select(., diag_p, odiag1:odiag3), 
                  function(x) (grepl("a[14]", x)) | (grepl("b[14]", x))                         )) > 1
             )

Is there a way to do this without making one new column for each grep statement, then making a final variable with case_when()?

edit: I updated the code to get rid of the case_when() which I see was confusing folks. This is what I want the code to look like, so that (a1 or a4) AND (b1 or b4) produce a match 'TRUE' or 'FALSE' on each row:

    structure(list(diag_p = c("a1", "a4", "c5", "a4", "b1"), odiag1 = c("b1", 
"b2", "c3", "d4", "e5"), odiag2 = c("f1", "g4", "h4", "i5", "a1"
), odiag3 = c("a6", "b1", "c8", "a1", "e10"), new_col = c(TRUE, 
TRUE, FALSE, FALSE, TRUE)), row.names = c(NA, -5L), class = "data.frame")

Solution

  • Using base R:

    sample_df$flag <-
      apply(sample_df, 1, \(row) {
        any(grepl("a[14]", row)) & any(grepl("b[14]", row))
      })
    
    sample_df
    #>   diag_p odiag1 odiag2 odiag3 new_col
    #> 1     a1     b1     f1     a6    TRUE
    #> 2     a4     b2     g4     b1    TRUE
    #> 3     c5     c3     h4     c8   FALSE
    #> 4     a4     d4     i5     a1   FALSE
    #> 5     b1     e5     a1    e10    TRUE