I have a rather large set of delimited data listing characteristics of individuals in rows similar to these synthetic data:
id; ICD_Codes
1; F10.10; F11.21; F31.81
2; G89.29; M54.5; F31.4; F11.21; F17.200; F43.10; Z72.0
3; F14.10; F17.200; F31.81; F31.32; F10.21
(And yes, the number of "columns" varies for each row)
I would like to extract from it a series of dummy codes, one for each ICD_Codes
value, and populate those dummy variables with whether a given individual presented with that value, i.e.:
id | F10.10 | F10.21 | F11.21 | F14.10 | F17.200 | F31.32 | F31.4 | F31.81 | F43.10 | G89.29 | M54.5 | Z72.0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 |
3 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
Or, as an R dataframe:
df <- data.frame(id = c(1, 2, 3),
F10.10 = c(1, 0, 0),
F10.21 = c(0, 0, 1),
F11.21 = c(1, 1, 0),
F14.10 = c(0, 0, 1),
F17.200 = c(0, 1, 1),
F31.32 = c(0, 0, 1),
F31.4 = c(0, 1, 0),
F31.81 = c(1, 0, 1),
F43.10 = c(0, 1, 0),
G89.29 = c(0, 1, 0),
M54.5 = c(0, 1, 0),
Z72.0 = c(0, 1, 0)
)
I can think of no easy way to do this (anywhere, including in R), but surely there is way!
I would appreciate any help doing so.
Assume you have not imported the data into R, we can use read.table
to read the data as a single column (choosing a random sep
symbol that you sure will not appear in your data).
Then split apart id
and ICD_Codes
into two columns, and use fastDummies::dummy_cols()
to create the dummy variables. Finally rename
to remove the column prefix to fit your desired output.
library(dplyr)
library(tidyr)
library(fastDummies)
df <- read.table(text = "id; ICD_Codes
1; F10.10; F11.21; F31.81
2; G89.29; M54.5; F31.4; F11.21; F17.200; F43.10; Z72.0
3; F14.10; F17.200; F31.81; F31.32; F10.21", sep = "@", header = T)
df %>%
separate_wider_delim("id..ICD_Codes", names = c("id", "ICD_Codes"), delim = "; ", too_many = "merge") %>%
dummy_cols("ICD_Codes", remove_selected_columns = T, split = ";") %>%
rename_with(~sub("ICD_Codes_", "", .x))
# A tibble: 3 × 13
id F10.10 F11.21 F31.81 F14.10 F17.200 F31.32 F10.21 G89.29 M54.5 F31.4 F43.10 Z72.0
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 1 1 1 0 0 0 0 0 0 0 0 0
2 2 0 1 0 0 1 0 0 1 1 1 1 1
3 3 0 0 1 1 1 1 1 0 0 0 0 0