I'm trying to sample a dummy based on probabilities that are part of my data.table. If my data.table only has two rows, this works:
library(data.table)
playdata <- data.table(id = c("a","b"), probabilities = c(0.2, 0.3))
playdata[, sampled_dummy := sample(c(0,1),1, prob = probabilities)]
if it has three or more rows, it does not:
library(data.table)
playdata <- data.table(id = c("a","b","c"), probabilities = c(0.2, 0.3, 0.4))
playdata[, sampled_dummy := sample(c(0,1),1, prob = probabilities)]
Error in sample.int(length(x), size, replace, prob) :
incorrect number of probabilities
Can someone explain this? I know I can apply any function row by row by force but why does sample break the standard data.table syntax? Should it not do everything row by row anyways?
edit: a usual workaround throws the same error:
playdata[, sampled_dummy := sample(c(0,1),1, prob = probabilities), by = seq_len(nrow(playdata))]
I think you need to do the sampling row-wise, so I'll demo with sapply
:
set.seed(42)
playdata[, sampled_dummy := sapply(probabilities, function(prob) sample(0:1, size=1, prob=c(prob,1-prob)))]
# id probabilities sampled_dummy
# <char> <num> <int>
# 1: a 0.2 0
# 2: b 0.3 0
# 3: c 0.4 1
Though I suspect that it might be easier for you to use runif(.N) > probabilites
?
playdata[, sampled_dummy := +(runif(.N) >= probabilities)]
# id probabilities sampled_dummy
# <char> <num> <int>
# 1: a 0.2 1
# 2: b 0.3 1
# 3: c 0.4 0