rdata.tablesample

sample with row specific probability in a data.table


I'm trying to sample a dummy based on probabilities that are part of my data.table. If my data.table only has two rows, this works:

library(data.table)
playdata <- data.table(id = c("a","b"), probabilities = c(0.2, 0.3))
playdata[, sampled_dummy := sample(c(0,1),1, prob = probabilities)]

if it has three or more rows, it does not:

library(data.table)
playdata <- data.table(id = c("a","b","c"), probabilities = c(0.2, 0.3, 0.4))
playdata[, sampled_dummy := sample(c(0,1),1, prob = probabilities)]

Error in sample.int(length(x), size, replace, prob) : 
  incorrect number of probabilities

Can someone explain this? I know I can apply any function row by row by force but why does sample break the standard data.table syntax? Should it not do everything row by row anyways?

edit: a usual workaround throws the same error:

playdata[, sampled_dummy := sample(c(0,1),1, prob = probabilities), by = seq_len(nrow(playdata))]

Solution

  • I think you need to do the sampling row-wise, so I'll demo with sapply:

    set.seed(42)
    playdata[, sampled_dummy := sapply(probabilities, function(prob) sample(0:1, size=1, prob=c(prob,1-prob)))]
    #        id probabilities sampled_dummy
    #    <char>         <num>         <int>
    # 1:      a           0.2             0
    # 2:      b           0.3             0
    # 3:      c           0.4             1
    

    Though I suspect that it might be easier for you to use runif(.N) > probabilites?

    playdata[, sampled_dummy := +(runif(.N) >= probabilities)]
    #        id probabilities sampled_dummy
    #    <char>         <num>         <int>
    # 1:      a           0.2             1
    # 2:      b           0.3             1
    # 3:      c           0.4             0