rdplyrcategorization

Using dplyr to categorize object in two columns of dataframe


Hi I have a sample data frame shown below:

   Policy_Holder_ID Insured_ID
   <chr>            <chr>     
 1 ID27343          ID215664  
 2 ID27310          ID27310   
 3 ID27343          ID205729  
 4 ID27343          ID205728  
 5 ID27348          ID205734  
 6 ID27348          ID205735  
 7 ID27315          ID205719  
 8 ID27315          ID27315   
 9 ID27345          ID205731  
10 ID27345          ID205733  
11 ID27345          ID27345   
12 ID2731           ID2731    
13 ID27310          ID205714  
14 ID27310          ID205715 

Sorry if it's not in dput form. I tried to use this function but didn't get a right result for it

What I want to is to categorize this data frame into 3 different categories as listed below:

  1. Group 1: those policy holders who only insured themselves. In other words, Policy_Holder_ID and Insured_ID are the same (example : ID2731)
  2. Group 2: those policy holders who only bought insurance for other people. In other words, they are listed in Policy_Holder_ID but not in Insured_ID and have 1 or more Insured_ID (example: ID27343)
  3. Group 3: those policy holders who purchased insurance for themselves and other people (example: ID27310)

So the output should look like this:

   Policy_Holder_ID Insured_ID    group
   <chr>            <chr>     
 1 ID27343          ID215664         2
 2 ID27310          ID27310          3
 3 ID27343          ID205729         2
 4 ID27343          ID205728         2
 5 ID27348          ID205734         2
 6 ID27348          ID205735         2
 7 ID27315          ID205719         3
 8 ID27315          ID27315          3  
 9 ID27345          ID205731         3
10 ID27345          ID205733         3
11 ID27345          ID27345          3  
12 ID2731           ID2731           1  
13 ID27310          ID205714         3 
14 ID27310          ID205715         3

I was hoping that you can provide a time-saving solution other than using the for loop for the data. My original data has more than 400000 rows so for loop doesn't help me.


Solution

  • We can use case_when after grouping by 'Policy_Holder_ID'. Based on the description, if we have all elements of 'Insured_ID' matches the 'Policy_Holder_ID' , then return 1, if none of them matches (!= -> again with all), then return 2, and the default option should return 3.

    library(dplyr)
    df1 %>% 
      group_by(Policy_Holder_ID) %>%
      mutate(group = case_when(all(Insured_ID == Policy_Holder_ID) ~ 1, 
              all(Insured_ID != Policy_Holder_ID)~ 2, 
           TRUE ~ 3)) %>%
      ungroup
    

    -output

    # A tibble: 14 x 3
    #   Policy_Holder_ID Insured_ID group
    #   <chr>            <chr>      <dbl>
    # 1 ID27343          ID215664       2
    # 2 ID27310          ID27310        3
    # 3 ID27343          ID205729       2
    # 4 ID27343          ID205728       2
    # 5 ID27348          ID205734       2
    # 6 ID27348          ID205735       2
    # 7 ID27315          ID205719       3
    # 8 ID27315          ID27315        3
    # 9 ID27345          ID205731       3
    #10 ID27345          ID205733       3
    #11 ID27345          ID27345        3
    #12 ID2731           ID2731         1
    #13 ID27310          ID205714       3
    #14 ID27310          ID205715       3
    

    data

    df1 <- structure(list(Policy_Holder_ID = c("ID27343", "ID27310", "ID27343", 
    "ID27343", "ID27348", "ID27348", "ID27315", "ID27315", "ID27345", 
    "ID27345", "ID27345", "ID2731", "ID27310", "ID27310"), Insured_ID = c("ID215664", 
    "ID27310", "ID205729", "ID205728", "ID205734", "ID205735", "ID205719", 
    "ID27315", "ID205731", "ID205733", "ID27345", "ID2731", "ID205714", 
    "ID205715")), class = "data.frame", row.names = c("1", "2", "3", 
    "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14"))