I have an R data frame that I need to perform a random binomial draw for each row. The n =
argument in the random binomial draw will be based on a value in a column of that row. Further, this operation should be within a case_when()
based upon a conditional in the data.
Note: R's rowwise()
function in tidyverse
is much too slow, the data frame is too large and is being performed at each timestep in a simulation model. Is there a way to quickly and efficiently do this?
Example:
library(tidyverse)
df = data.frame(condition = c("A","B","A","B","C"),
number = c(1000,1000,1000,1000,1))
prob1 = 0.000517143
prob2 = 0.000213472
set.seed(1)
df = df %>%
mutate(output = case_when(condition == "A" ~ sum(rbinom(n = number,
size = 1,
prob = prob1)),
condition == "B" ~ sum(rbinom(n = number,
size = 1,
prob = prob2)),
TRUE ~ 0))
print(df)
#> condition number output
#> 1 A 1000 0
#> 2 B 1000 0
#> 3 A 1000 0
#> 4 B 1000 0
#> 5 C 1 0
Here, it looks like the random binomial draws are being reused and returning all zeros.
For a check, here it is sampled repeatedly. Feasibly, the sum(df$output)
should be around 2 each draw.
for(i in 1:10){
df = df %>%
mutate(output = case_when(condition == "A" ~ sum(rbinom(n = number,
size = 1,
prob = prob1)),
condition == "B" ~ sum(rbinom(n = number,
size = 1,
prob = prob2)),
TRUE ~ 0))
print(sum(df$output))}
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0
Unsure of the way forward.
Why are you summing draws of size 1? Refer to Wikipedia:
In probability theory and statistics, the binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question, and each with its own Boolean-valued outcome: success (with probability p) or failure (with probability q = 1 − p).
Thus, you can sample once per row and don't need to sum. Since rbinom
is fully vectorized, you don't need a loop.
df <- merge(df, data.frame(condition = c("A", "B"),
prob = c(0.000517143, 0.000213472)),
by = "condition", all.x = TRUE)
df[is.na(df$prob), "prob"] <- 0
set.seed(1)
df$output <- with(df, rbinom(length(number), size = number, prob = prob))
# condition number prob output
#1 A 1000 0.000517143 0
#2 A 1000 0.000517143 0
#3 B 1000 0.000213472 0
#4 B 1000 0.000213472 1
#5 C 1 0.000000000 0