I have an enormous dataset that contains 25 columns of medical codes. Each row represents one medical visit. I need to create a new column that flags where two codes appear together on each row. In other words, I want to grep across multiple columns and flag where the two codes are present together the medical visit.
I thougt about creating a new variable for each alphanumeric code I want to grep on, then creating a final variables with case_when(), but is there a faster way to do this?
Here's a toy data set:
diag_p <- c('a1', 'a4', 'c5', 'a4', 'b1')
odiag1 <- c('b1', 'b2', 'c3', 'd4', 'e5')
odiag2 <- c('f1', 'g4', 'h4', 'i5', 'a1')
odiag3 <- c('a6', 'b1', 'c8', 'a1', 'e10')
sample_df <- data.frame(diag_p, odiag1, odiag2, odiag3)
This code works well to search across columns with | and two grep statements, and with '>1' at the end of the chunk, which will count more than one match across columns, but it doesn't work quite right. I need it to match (a1 or a4) & (b1 or b4).
new_df <- sample_df %>%
mutate(
new_col = rowSums(sapply(select(., diag_p, odiag1:odiag3),
function(x) (grepl("a[14]", x)) | (grepl("b[14]", x)) )) > 1
)
Is there a way to do this without making one new column for each grep statement, then making a final variable with case_when()?
edit: I updated the code to get rid of the case_when() which I see was confusing folks. This is what I want the code to look like, so that (a1 or a4) AND (b1 or b4) produce a match 'TRUE' or 'FALSE' on each row:
structure(list(diag_p = c("a1", "a4", "c5", "a4", "b1"), odiag1 = c("b1",
"b2", "c3", "d4", "e5"), odiag2 = c("f1", "g4", "h4", "i5", "a1"
), odiag3 = c("a6", "b1", "c8", "a1", "e10"), new_col = c(TRUE,
TRUE, FALSE, FALSE, TRUE)), row.names = c(NA, -5L), class = "data.frame")
Using base R:
sample_df$flag <-
apply(sample_df, 1, \(row) {
any(grepl("a[14]", row)) & any(grepl("b[14]", row))
})
sample_df
#> diag_p odiag1 odiag2 odiag3 new_col
#> 1 a1 b1 f1 a6 TRUE
#> 2 a4 b2 g4 b1 TRUE
#> 3 c5 c3 h4 c8 FALSE
#> 4 a4 d4 i5 a1 FALSE
#> 5 b1 e5 a1 e10 TRUE