rdataframedummy-variable

Extracting Dummy Variables from Rows of Values


I have a rather large set of delimited data listing characteristics of individuals in rows similar to these synthetic data:

id; ICD_Codes
1; F10.10; F11.21; F31.81
2; G89.29; M54.5; F31.4; F11.21; F17.200; F43.10; Z72.0
3; F14.10; F17.200; F31.81; F31.32; F10.21

(And yes, the number of "columns" varies for each row)

I would like to extract from it a series of dummy codes, one for each ICD_Codes value, and populate those dummy variables with whether a given individual presented with that value, i.e.:

id F10.10 F10.21 F11.21 F14.10 F17.200 F31.32 F31.4 F31.81 F43.10 G89.29 M54.5 Z72.0
1 1 0 1 0 0 0 0 1 0 0 0 0
2 0 0 1 0 1 0 1 0 1 1 1 1
3 0 1 0 1 1 1 0 1 0 0 0 0

Or, as an R dataframe:

df <- data.frame(id = c(1, 2, 3),
                 F10.10 = c(1, 0, 0),
                 F10.21 = c(0, 0, 1),
                 F11.21 = c(1, 1, 0),
                 F14.10 = c(0, 0, 1),
                 F17.200 = c(0, 1, 1),
                 F31.32 = c(0, 0, 1),
                 F31.4 = c(0, 1, 0),
                 F31.81 = c(1, 0, 1),
                 F43.10 = c(0, 1, 0),
                 G89.29 = c(0, 1, 0),
                 M54.5 = c(0, 1, 0),
                 Z72.0 = c(0, 1, 0)
                 )

I can think of no easy way to do this (anywhere, including in R), but surely there is way!

I would appreciate any help doing so.


Solution

  • Assume you have not imported the data into R, we can use read.table to read the data as a single column (choosing a random sep symbol that you sure will not appear in your data).

    Then split apart id and ICD_Codes into two columns, and use fastDummies::dummy_cols() to create the dummy variables. Finally rename to remove the column prefix to fit your desired output.

    library(dplyr)
    library(tidyr)
    library(fastDummies)
    
    df <- read.table(text = "id; ICD_Codes
    1; F10.10; F11.21; F31.81
    2; G89.29; M54.5; F31.4; F11.21; F17.200; F43.10; Z72.0
    3; F14.10; F17.200; F31.81; F31.32; F10.21", sep = "@", header = T)
    
    df %>% 
      separate_wider_delim("id..ICD_Codes", names = c("id", "ICD_Codes"), delim = "; ", too_many = "merge") %>% 
      dummy_cols("ICD_Codes", remove_selected_columns = T, split = ";") %>% 
      rename_with(~sub("ICD_Codes_", "", .x))
    
    # A tibble: 3 × 13
      id    F10.10 F11.21 F31.81 F14.10 F17.200 F31.32 F10.21 G89.29 M54.5 F31.4 F43.10 Z72.0
      <chr>  <int>  <int>  <int>  <int>   <int>  <int>  <int>  <int> <int> <int>  <int> <int>
    1 1          1      1      1      0       0      0      0      0     0     0      0     0
    2 2          0      1      0      0       1      0      0      1     1     1      1     1
    3 3          0      0      1      1       1      1      1      0     0     0      0     0