I am working with a clinical data table that contains hospital episodes for a few hundred thousand people across an approx 15 year period. I have cleaned my data table to now be formatted as one row per episode, each of which contains the date of the episode and the associated primary and secondary diagnoses (up to 20 secondary diagnoses). The diagnoses columns contain strings of ICD-10 disease codes.
I am interested in extracting data on approximately 60 conditions from this table, each of which is made up of a variable number of clinical codes (e.g. asthma has five associated ICD-10 codes, anxiety 3 codes, lyme disease 3 codes..). I'd like to write a function that searches across all of the primary and secondary diagnosis columns, checks for any of the strings associated with the 60 conditions, and creates a binary yes/no output which is coded 1 if any of the associated ICD codes are present, or otherwise 0.
I've worked out how to do this on the primary diagnoses column only ('diag_icd10') using grepl within ifelse within mutate, but having to write across many many lines of code to get what I want for all 60 conditions. Shortened example below...
anxiety_codes <- c("F400", "F401", "F402")
asthma_codes <- c("J450", "J451", "J458", "J459", "J46X")
chronic_lyme_codes <- c("A692", "G630", "M012")
# Checking primary diagnosis column 'diag_icd10'
data <- data |>
mutate(
anxiety_prim = ifelse(grepl(paste(anxiety_codes, collapse = '|'), diag_icd10), 1, 0),
asthma_prim = ifelse(grepl(paste(asthma_codes, collapse = '|'), diag_icd10), 1, 0),
lymedisease_prim = ifelse(grepl(paste(chronic_lyme_codes, collapse = '|'), diag_icd10), 1, 0))
My problems are:
condition_prim
variables without having to write 60+ lines of code."condition"_sec
variable whether the conditions were present or not.condition
variable whether the conditions were present or not.Here is one approach using example data. You can make a named list of your diagnosis vectors:
anxiety_codes <- c("F400", "F401", "F402")
asthma_codes <- c("J450", "J451", "J458", "J459", "J46X")
chronic_lyme_codes <- c("A692", "G630", "M012")
dx_list <- mget(ls(pattern = "(\\w+)_codes"))
names(dx_list) <- sub("_codes$", "", names(dx_list))
dx_list
Which looks like this:
$anxiety
[1] "F400" "F401" "F402"
$asthma
[1] "J450" "J451" "J458" "J459" "J46X"
$chronic_lyme
[1] "A692" "G630" "M012"
I made an example data.frame with an id
, the primary diagnosis code, and a couple of secondary diagnostic codes.
df <- data.frame(
id = 1:3,
diag_icd10 = c("F400", "J458", "M012"),
sec_diag_0 = c("J450", NA, "J46X"),
sec_diag_1 = c(NA, NA, "F401")
)
Here is the data.frame:
id diag_icd10 sec_diag_0 sec_diag_1
1 1 F400 J450 <NA>
2 2 J458 <NA> <NA>
3 3 M012 J46X F401
You can use sapply
through your diagnosis list, and first check the primary diagnosis column and change column name to include "prim" for primary. For secondary diagnoses, use Reduce
and lapply
through multiple columns, then label with suffix of "sec" for secondary.
df_result <- cbind(
df[1],
setNames(
as.data.frame(sapply(dx_list, \(x) +(df[,2] %in% x))),
paste(names(dx_list), "prim", sep = "_")
),
setNames(
as.data.frame(sapply(dx_list, \(x) +Reduce(`|`, lapply(df[,3:4], `%in%`, x)))),
paste(names(dx_list), "sec", sep = "_")
)
)
The result looks like this:
id anxiety_prim asthma_prim chronic_lyme_prim anxiety_sec asthma_sec chronic_lyme_sec
1 1 1 0 0 0 1 0
2 2 0 1 0 0 0 0
3 3 0 0 1 1 1 0
You can combine like columns and add additional summary (primary or secondary diagnosis):
sapply(split.default(df_result[-1], sub("_(prim|sec)$", "", names(df_result[-1]))),
\(x) +(rowSums(x) > 0))
Which has this output (can be added to df_result
if desired):
anxiety asthma chronic_lyme
[1,] 1 1 0
[2,] 0 1 0
[3,] 1 1 1
Or, if you want to just do all columns at once, try:
cbind(
df[1],
setNames(
as.data.frame(sapply(dx_list, \(x) +(df[,2] %in% x))),
paste(names(dx_list), "prim", sep = "_")
),
setNames(
as.data.frame(sapply(dx_list, \(x) +Reduce(`|`, lapply(df[,3:4], `%in%`, x)))),
paste(names(dx_list), "sec", sep = "_")
),
sapply(dx_list, \(x) +Reduce(`|`, lapply(df[,2:4], `%in%`, x)))
)