I am want to sample from the binomial distribution B(n,p) but with an additional constraint that the sampled value belongs in the range [a,b] (instead of the normal 0 to n range). In other words, I have to sample a value from binomial distribution given that it lies in the range [a,b]. Mathematically, I can write the pmf of this distribution (f(x)
) in terms of the pmf of binomial distribution bin(x) = [(nCx)*(p)^x*(1-p)^(n-x)]
as
sum = 0
for i in range(a,b+1):
sum += bin(i)
f(x) = bin(x)/sum
One way of sampling from this distribution is to sample a uniformly distributed number and apply the inverse of the CDF(obtained using the pmf). However, I don't think this is a good idea as the pmf calculation would easily get very time-consuming.
The values of n,x,a,b
are quite large in my case and this way of computing pmf and then using a uniform random variable to generate the sample seems extremely inefficient due to the factorial terms in nCx
.
What's a nice/efficient way to achieve this?
This is a way to collect all the values of bin
in a pretty short time:
from scipy.special import comb
import numpy as np
def distribution(n, p=0.5):
x = np.arange(n+1)
return comb(n, x, exact=False) * p ** x * (1 - p) ** (n - x)
It can be done in a quarter of microsecond for n=1000
.
Sample run:
>>> distribution(4):
array([0.0625, 0.25 , 0.375 , 0.25 , 0.0625])
You can sum specific parts of this array like so:
>>> np.sum(distribution(4)[2:4])
0.625
Remark: For n>1000
middle values of this distribution requires to use extremely large numbers in multiplication therefore RuntimeWarning
is raised.
You can use scipy.stats.binom
equivalently:
from scipy.stats import binom
def distribution(n, p):
return binom.pmf(np.arange(n+1), n, p)
This does the same as above mentioned method quite efficiently (n=1000000
in a third of second). Alternatively, you can use binom.cdf(np.arange(n+1), n, p)
which calculate cumulative sum of binom.pmf
. Then subtraction of b
th and a
th items of this array gives an output which is very close to what you expect.