rfunctiongrepl

Function to search multiple strings across multiple columns and create binary output variables


I am working with a clinical data table that contains hospital episodes for a few hundred thousand people across an approx 15 year period. I have cleaned my data table to now be formatted as one row per episode, each of which contains the date of the episode and the associated primary and secondary diagnoses (up to 20 secondary diagnoses). The diagnoses columns contain strings of ICD-10 disease codes.

I am interested in extracting data on approximately 60 conditions from this table, each of which is made up of a variable number of clinical codes (e.g. asthma has five associated ICD-10 codes, anxiety 3 codes, lyme disease 3 codes..). I'd like to write a function that searches across all of the primary and secondary diagnosis columns, checks for any of the strings associated with the 60 conditions, and creates a binary yes/no output which is coded 1 if any of the associated ICD codes are present, or otherwise 0.

I've worked out how to do this on the primary diagnoses column only ('diag_icd10') using grepl within ifelse within mutate, but having to write across many many lines of code to get what I want for all 60 conditions. Shortened example below...

anxiety_codes <- c("F400", "F401", "F402")
asthma_codes <- c("J450", "J451", "J458", "J459", "J46X")
chronic_lyme_codes <- c("A692", "G630", "M012")

# Checking primary diagnosis column 'diag_icd10'
data <- data |> 
  mutate(
    anxiety_prim = ifelse(grepl(paste(anxiety_codes, collapse = '|'), diag_icd10), 1, 0),
    asthma_prim = ifelse(grepl(paste(asthma_codes, collapse = '|'), diag_icd10), 1, 0),
    lymedisease_prim = ifelse(grepl(paste(chronic_lyme_codes, collapse = '|'), diag_icd10), 1, 0)) 

My problems are:

  1. I'd like to make this a function that checks for all 60 conditions and creates the binary condition_prim variables without having to write 60+ lines of code.
  2. I'd also like to create a similar function that checks for the same string patterns but across multiple secondary diagnosis columns (21 columns total... 'sec_diag_0' : 'sec_diag_20') and returns to a "condition"_sec variable whether the conditions were present or not.
  3. To amalgamate the two, checking across both the primary and secondary columns and returning to a single condition variable whether the conditions were present or not.

Solution

  • Here is one approach using example data. You can make a named list of your diagnosis vectors:

    anxiety_codes <- c("F400", "F401", "F402")
    asthma_codes <- c("J450", "J451", "J458", "J459", "J46X")
    chronic_lyme_codes <- c("A692", "G630", "M012")
    
    dx_list <- mget(ls(pattern = "(\\w+)_codes"))
    names(dx_list) <- sub("_codes$", "", names(dx_list))
    
    dx_list
    

    Which looks like this:

    $anxiety
    [1] "F400" "F401" "F402"
    
    $asthma
    [1] "J450" "J451" "J458" "J459" "J46X"
    
    $chronic_lyme
    [1] "A692" "G630" "M012"
    

    I made an example data.frame with an id, the primary diagnosis code, and a couple of secondary diagnostic codes.

    df <- data.frame(
      id = 1:3,
      diag_icd10 = c("F400", "J458", "M012"),
      sec_diag_0 = c("J450", NA, "J46X"),
      sec_diag_1 = c(NA, NA, "F401")
    )
    

    Here is the data.frame:

      id diag_icd10 sec_diag_0 sec_diag_1
    1  1       F400       J450       <NA>
    2  2       J458       <NA>       <NA>
    3  3       M012       J46X       F401
    

    You can use sapply through your diagnosis list, and first check the primary diagnosis column and change column name to include "prim" for primary. For secondary diagnoses, use Reduce and lapply through multiple columns, then label with suffix of "sec" for secondary.

    df_result <- cbind(
      df[1],
      setNames(
        as.data.frame(sapply(dx_list, \(x) +(df[,2] %in% x))),
        paste(names(dx_list), "prim", sep = "_")
      ),
      setNames(
        as.data.frame(sapply(dx_list, \(x) +Reduce(`|`, lapply(df[,3:4], `%in%`, x)))),
        paste(names(dx_list), "sec", sep = "_")
      )
    )
    

    The result looks like this:

      id anxiety_prim asthma_prim chronic_lyme_prim anxiety_sec asthma_sec chronic_lyme_sec
    1  1            1           0                 0           0          1                0
    2  2            0           1                 0           0          0                0
    3  3            0           0                 1           1          1                0
    

    You can combine like columns and add additional summary (primary or secondary diagnosis):

    sapply(split.default(df_result[-1], sub("_(prim|sec)$", "", names(df_result[-1]))),
           \(x) +(rowSums(x) > 0))
    

    Which has this output (can be added to df_result if desired):

         anxiety asthma chronic_lyme
    [1,]       1      1            0
    [2,]       0      1            0
    [3,]       1      1            1
    

    Or, if you want to just do all columns at once, try:

    cbind(
      df[1],
      setNames(
        as.data.frame(sapply(dx_list, \(x) +(df[,2] %in% x))),
        paste(names(dx_list), "prim", sep = "_")
      ),
      setNames(
        as.data.frame(sapply(dx_list, \(x) +Reduce(`|`, lapply(df[,3:4], `%in%`, x)))),
        paste(names(dx_list), "sec", sep = "_")
      ),
      sapply(dx_list, \(x) +Reduce(`|`, lapply(df[,2:4], `%in%`, x)))
    )