Suppose I have a series of n probabilities for success of independent Bernoulli trials, p1 to pn such that p1 != p2 != ... != pn. Give each trial a unique name.
p <- c(0.5, 0.12, 0.7, 0.8, .02)
a <- c("A","B","C","D","E")
I know from searching stack exchange (e.g., here and here) that I can find the cdf, pmf, etc. using the Poisson Binomial distribution function.
What I'm interested in is the exact probability of every possible combination of success and failures. (E.g. If I drew a probability tree, I want to know the probability at the end of each branch.)
all <- prod(p)
all
[1] 0.000672
o1 <- (0.5 * (1-0.12) * 0.7 * 0.8 * .02)
o1
[1] 0.004928
o2 <- (0.5 * 0.12 * (1-0.7) * 0.8 * .02)
o2
[1] 0.000288
...for all 2^5 possible combinations of success/failure.
What's an efficient way to go about this in R?
In the case of my actual data set, the number of trials is 19, so we're talking about 2^19 total paths on the probability tree.
The key to making this computation fast is to do it in log-probability space so that the product for each branch of the tree is a sum that can be computed as the inner sum of a matrix multiply. In this manner, all the branches can be computed together in vectorized fashion.
First, we construct an enumeration of all branches. For this, we use the intToBin
function from R.utils
package:
library(R.utils)
enum.branches <- unlist(strsplit(intToBin(seq_len(2^n)-1),split=""))
where n
is the number of Bernoulli variables. For your example, n=5
:
matrix(enum.branches, nrow=n)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17]
##[1,] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "1"
##[2,] "0" "0" "0" "0" "0" "0" "0" "0" "1" "1" "1" "1" "1" "1" "1" "1" "0"
##[3,] "0" "0" "0" "0" "1" "1" "1" "1" "0" "0" "0" "0" "1" "1" "1" "1" "0"
##[4,] "0" "0" "1" "1" "0" "0" "1" "1" "0" "0" "1" "1" "0" "0" "1" "1" "0"
##[5,] "0" "1" "0" "1" "0" "1" "0" "1" "0" "1" "0" "1" "0" "1" "0" "1" "0"
## [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32]
##[1,] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
##[2,] "0" "0" "0" "0" "0" "0" "0" "1" "1" "1" "1" "1" "1" "1" "1"
##[3,] "0" "0" "0" "1" "1" "1" "1" "0" "0" "0" "0" "1" "1" "1" "1"
##[4,] "0" "1" "1" "0" "0" "1" "1" "0" "0" "1" "1" "0" "0" "1" "1"
##[5,] "1" "0" "1" "0" "1" "0" "1" "0" "1" "0" "1" "0" "1" "0" "1"
results in a matrix where each column is the outcomes from a branch of the probability tree.
Now, use that to construct a matrix of log probabilities of the same size as enum.branches
where the value is log(p)
if enum.branches=="1"
and log(1-p)
otherwise. For your data, with p <- c(0.5, 0.12, 0.7, 0.8, .02)
, this is:
logp <- matrix(ifelse(enum.branches == "1", rep(log(p), 2^n), rep(log(1-p), 2^n)), nrow=n)
Then, sum the log-probabilities and take the exponential to get the product of the probabilities:
result <- exp(rep(1,n) %*% logp)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##[1,] 0.025872 0.000528 0.103488 0.002112 0.060368 0.001232 0.241472 0.004928 0.003528 7.2e-05
[,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
##[1,] 0.014112 0.000288 0.008232 0.000168 0.032928 0.000672 0.025872 0.000528 0.103488 0.002112
[,21] [,22] [,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30]
##[1,] 0.060368 0.001232 0.241472 0.004928 0.003528 7.2e-05 0.014112 0.000288 0.008232 0.000168
[,31] [,32]
##[1,] 0.032928 0.000672
The result
will be in the same order as the numeration of branches in enum.branches
.
We can encapsulate the computation into a function:
enum.prob.product <- function(n, p) {
enum.branches <- unlist(strsplit(intToBin(seq_len(2^n)-1),split=""))
exp(rep(1,n) %*% matrix(ifelse(enum.branches == "1", rep(log(p), 2^n), rep(log(1-p), 2^n)), nrow=n))
}
Timing this with 19
independent Bernoulli variables:
n <- 19
p <- runif(n)
system.time(enum.prob.product(n,p))
## user system elapsed
## 24.064 1.470 26.082
This is on my 2 GHz MacBook (circa 2009). It should be noted that the computation itself is quite fast; it is the enumeration of the branches of the probability tree (I would guess the unlist
within that) that is taking the bulk of the time. Any suggestions from the community on another approach to do that will be appreciated.