rdplyrduplicatesdata-transform

r randomly assign 1 or 0 based on conditons


For a dataset like this

    MainID    SubID     DOB            BMI
    1234      1234_A    Feb-19-2024    10.1

    1235      1235_A    Jan-11-2023    17.23
    1235      1235_B    Jan-11-2023    19.11

    5136      5136_A    May-17-2021    21.87
    5136      5136_B    May-17-2021    14.18
    5136      5136_C    May-17-2021    18.11

    3357      3357-A    Oct-06-2023    24.10   

    9124      9124-B    July-01-2021   12.09
    9124      9124-B    July-01-2021   15.06

I am trying to randomly assign a value 0 or 1 only if the values in MainID and DOB are same SubID is different. Expecting a dataset like this

    MainID    SubID     DOB             BMI   Col1
    1234      1234_A    Feb-19-2024     10.1  0

    1235      1235_A    Jan-11-2023     17.23 0
    1235      1235_B    Jan-11-2023     19.11 1

    3357      3357-A    Oct-06-2023     24.10 0

    5136      5136_A    May-17-2021     21.87 0  
    5136      5136_B    May-17-2021     14.18 0
    5136      5136_C    May-17-2021     18.11 1

    9124      9124-B    July-01-2021    12.09 0
    9124      9124-B    July-01-2021    15.06 0

Here only rows with ID 1235 and 5136 are assigned 0 or 1 because the repeated rows have same MainID, DOB and different SubID.

I tried options with ifelse and duplicated(df[c("MainID", "DOB")]) but this did not work. Any suggestion is much appreciated. Thanks in advance.


Solution

  • It's not clear if you want 0:1 sampled with replacement. I'm assuming always 0:1 and with-replacement only when the number of rows in a group is more than 2.

    Up front, both dplyr and base-R code have a snippet similar to this.

    sample(c(0:1, sample(0:1, size = n()-2, replace = (n() > 4))))
    

    dplyr

    library(dplyr)
    set.seed(42)
    quux |>
      mutate(
        Col1 = if (n_distinct(SubID) > 1) {
            sample(c(0:1, sample(0:1, size = n()-2, replace = (n() > 4))))
          } else 0,
        .by = c(MainID, DOB)
      )
    #   MainID  SubID          DOB   BMI Col1
    # 1   1234 1234_A  Feb-19-2024 10.10    0
    # 2   1235 1235_A  Jan-11-2023 17.23    0
    # 3   1235 1235_B  Jan-11-2023 19.11    1
    # 4   5136 5136_A  May-17-2021 21.87    0
    # 5   5136 5136_B  May-17-2021 14.18    1
    # 6   5136 5136_C  May-17-2021 18.11    0
    # 7   3357 3357-A  Oct-06-2023 24.10    0
    # 8   9124 9124-B July-01-2021 12.09    0
    # 9   9124 9124-B July-01-2021 15.06    0
    

    base R

    ave(seq_len(nrow(quux)), quux[, c("MainID", "DOB")],
        FUN = function(ind) {
          nr <- length(ind)
          nsub <- length(unique(quux$SubID[ind]))
          if (nsub > 1) sample(c(0:1, sample(0:1, size = nr-2, replace = (nr>4)))) else 0
        })
    # [1] 0 0 1 0 1 0 0 0 0
    

    Data

    quux <- structure(list(MainID = c(1234L, 1235L, 1235L, 5136L, 5136L, 5136L, 3357L, 9124L, 9124L), SubID = c("1234_A", "1235_A", "1235_B", "5136_A", "5136_B", "5136_C", "3357-A", "9124-B", "9124-B"), DOB = c("Feb-19-2024", "Jan-11-2023", "Jan-11-2023", "May-17-2021", "May-17-2021", "May-17-2021", "Oct-06-2023", "July-01-2021", "July-01-2021"), BMI = c(10.1, 17.23, 19.11, 21.87, 14.18, 18.11, 24.1, 12.09, 15.06)), class = "data.frame", row.names = c(NA, -9L))