For a dataset like this
MainID SubID DOB BMI
1234 1234_A Feb-19-2024 10.1
1235 1235_A Jan-11-2023 17.23
1235 1235_B Jan-11-2023 19.11
5136 5136_A May-17-2021 21.87
5136 5136_B May-17-2021 14.18
5136 5136_C May-17-2021 18.11
3357 3357-A Oct-06-2023 24.10
9124 9124-B July-01-2021 12.09
9124 9124-B July-01-2021 15.06
I am trying to randomly assign a value 0 or 1 only if the values in MainID and DOB are same SubID is different. Expecting a dataset like this
MainID SubID DOB BMI Col1
1234 1234_A Feb-19-2024 10.1 0
1235 1235_A Jan-11-2023 17.23 0
1235 1235_B Jan-11-2023 19.11 1
3357 3357-A Oct-06-2023 24.10 0
5136 5136_A May-17-2021 21.87 0
5136 5136_B May-17-2021 14.18 0
5136 5136_C May-17-2021 18.11 1
9124 9124-B July-01-2021 12.09 0
9124 9124-B July-01-2021 15.06 0
Here only rows with ID 1235 and 5136 are assigned 0 or 1 because the repeated rows have same MainID, DOB and different SubID.
I tried options with ifelse
and duplicated(df[c("MainID", "DOB")])
but this did not work. Any suggestion is much appreciated. Thanks in advance.
It's not clear if you want 0:1
sampled with replacement. I'm assuming always 0:1
and with-replacement only when the number of rows in a group is more than 2.
Up front, both dplyr and base-R code have a snippet similar to this.
sample(c(0:1, sample(0:1, size = n()-2, replace = (n() > 4))))
0
and one 1
in the bunch; if we did sample(0:1, size=n(), replace=TRUE)
, it is feasible (though improbable) that we can have all 0
s or all 1
s, and I'm inferring that we want/need at least one of each;c(0:1, ..)
, forcing the presence of each number;sample
to produce a size of n() - 2
(2 fewer than there are rows); if there are only four rows, then the outer 0:1
and the inner 0:1
will suffice, so we don't technically need replacement, ergo replace=(n() > 4)
0:1
and may have zero or more 0
s and 1
s based on the number of rows in the group, we need to randomly reorder those (because we don't want the first two to be always 0:1
library(dplyr)
set.seed(42)
quux |>
mutate(
Col1 = if (n_distinct(SubID) > 1) {
sample(c(0:1, sample(0:1, size = n()-2, replace = (n() > 4))))
} else 0,
.by = c(MainID, DOB)
)
# MainID SubID DOB BMI Col1
# 1 1234 1234_A Feb-19-2024 10.10 0
# 2 1235 1235_A Jan-11-2023 17.23 0
# 3 1235 1235_B Jan-11-2023 19.11 1
# 4 5136 5136_A May-17-2021 21.87 0
# 5 5136 5136_B May-17-2021 14.18 1
# 6 5136 5136_C May-17-2021 18.11 0
# 7 3357 3357-A Oct-06-2023 24.10 0
# 8 9124 9124-B July-01-2021 12.09 0
# 9 9124 9124-B July-01-2021 15.06 0
ave(seq_len(nrow(quux)), quux[, c("MainID", "DOB")],
FUN = function(ind) {
nr <- length(ind)
nsub <- length(unique(quux$SubID[ind]))
if (nsub > 1) sample(c(0:1, sample(0:1, size = nr-2, replace = (nr>4)))) else 0
})
# [1] 0 0 1 0 1 0 0 0 0
Data
quux <- structure(list(MainID = c(1234L, 1235L, 1235L, 5136L, 5136L, 5136L, 3357L, 9124L, 9124L), SubID = c("1234_A", "1235_A", "1235_B", "5136_A", "5136_B", "5136_C", "3357-A", "9124-B", "9124-B"), DOB = c("Feb-19-2024", "Jan-11-2023", "Jan-11-2023", "May-17-2021", "May-17-2021", "May-17-2021", "Oct-06-2023", "July-01-2021", "July-01-2021"), BMI = c(10.1, 17.23, 19.11, 21.87, 14.18, 18.11, 24.1, 12.09, 15.06)), class = "data.frame", row.names = c(NA, -9L))