pythonnumpyrandombinomial-cdf

Efficient sampling from a 'partial' binomial distribution


I am want to sample from the binomial distribution B(n,p) but with an additional constraint that the sampled value belongs in the range [a,b] (instead of the normal 0 to n range). In other words, I have to sample a value from binomial distribution given that it lies in the range [a,b]. Mathematically, I can write the pmf of this distribution (f(x)) in terms of the pmf of binomial distribution bin(x) = [(nCx)*(p)^x*(1-p)^(n-x)] as

sum = 0
for i in range(a,b+1):
    sum += bin(i)

f(x) = bin(x)/sum

One way of sampling from this distribution is to sample a uniformly distributed number and apply the inverse of the CDF(obtained using the pmf). However, I don't think this is a good idea as the pmf calculation would easily get very time-consuming.

The values of n,x,a,b are quite large in my case and this way of computing pmf and then using a uniform random variable to generate the sample seems extremely inefficient due to the factorial terms in nCx.

What's a nice/efficient way to achieve this?


Solution

  • This is a way to collect all the values of bin in a pretty short time:

    from scipy.special import comb
    import numpy as np
    def distribution(n, p=0.5):
        x = np.arange(n+1)
        return comb(n, x, exact=False) * p ** x * (1 - p) ** (n - x)
    

    It can be done in a quarter of microsecond for n=1000.

    Sample run:

    >>> distribution(4):
    array([0.0625, 0.25  , 0.375 , 0.25  , 0.0625])
    

    You can sum specific parts of this array like so:

    >>> np.sum(distribution(4)[2:4])
    0.625
    

    Remark: For n>1000 middle values of this distribution requires to use extremely large numbers in multiplication therefore RuntimeWarning is raised.

    Bugfix

    You can use scipy.stats.binom equivalently:

    from scipy.stats import binom
    def distribution(n, p):
        return binom.pmf(np.arange(n+1), n, p)
    

    This does the same as above mentioned method quite efficiently (n=1000000 in a third of second). Alternatively, you can use binom.cdf(np.arange(n+1), n, p) which calculate cumulative sum of binom.pmf. Then subtraction of bth and ath items of this array gives an output which is very close to what you expect.