algorithmmathprobabilitybinomial-cdf

How can I efficiently calculate the binomial cumulative distribution function?


Let's say that I know the probability of a "success" is P. I run the test N times, and I see S successes. The test is akin to tossing an unevenly weighted coin (perhaps heads is a success, tails is a failure).

I want to know the approximate probability of seeing either S successes, or a number of successes less likely than S successes.

So for example, if P is 0.3, N is 100, and I get 20 successes, I'm looking for the probability of getting 20 or fewer successes.

If, on the other hadn, P is 0.3, N is 100, and I get 40 successes, I'm looking for the probability of getting 40 our more successes.

I'm aware that this problem relates to finding the area under a binomial curve, however:

  1. My math-fu is not up to the task of translating this knowledge into efficient code
  2. While I understand a binomial curve would give an exact result, I get the impression that it would be inherently inefficient. A fast method to calculate an approximate result would suffice.

I should stress that this computation has to be fast, and should ideally be determinable with standard 64 or 128 bit floating point computation.

I'm looking for a function that takes P, S, and N - and returns a probability. As I'm more familiar with code than mathematical notation, I'd prefer that any answers employ pseudo-code or code.


Solution

  • Exact Binomial Distribution

    def factorial(n): 
        if n < 2: return 1
        return reduce(lambda x, y: x*y, xrange(2, int(n)+1))
    
    def prob(s, p, n):
        x = 1.0 - p
    
        a = n - s
        b = s + 1
    
        c = a + b - 1
    
        prob = 0.0
    
        for j in xrange(a, c + 1):
            prob += factorial(c) / (factorial(j)*factorial(c-j)) \
                    * x**j * (1 - x)**(c-j)
    
        return prob
    
    >>> prob(20, 0.3, 100)
    0.016462853241869437
    
    >>> 1-prob(40-1, 0.3, 100)
    0.020988576003924564
    

    Normal Estimate, good for large n

    import math
    def erf(z):
            t = 1.0 / (1.0 + 0.5 * abs(z))
            # use Horner's method
            ans = 1 - t * math.exp( -z*z -  1.26551223 +
                                                    t * ( 1.00002368 +
                                                    t * ( 0.37409196 + 
                                                    t * ( 0.09678418 + 
                                                    t * (-0.18628806 + 
                                                    t * ( 0.27886807 + 
                                                    t * (-1.13520398 + 
                                                    t * ( 1.48851587 + 
                                                    t * (-0.82215223 + 
                                                    t * ( 0.17087277))))))))))
            if z >= 0.0:
                    return ans
            else:
                    return -ans
    
    def normal_estimate(s, p, n):
        u = n * p
        o = (u * (1-p)) ** 0.5
    
        return 0.5 * (1 + erf((s-u)/(o*2**0.5)))
    
    >>> normal_estimate(20, 0.3, 100)
    0.014548164531920815
    
    >>> 1-normal_estimate(40-1, 0.3, 100)
    0.024767304545069813
    

    Poisson Estimate: Good for large n and small p

    import math
    
    def poisson(s,p,n):
        L = n*p
    
        sum = 0
        for i in xrange(0, s+1):
            sum += L**i/factorial(i)
    
        return sum*math.e**(-L)
    
    >>> poisson(20, 0.3, 100)
    0.013411150012837811
    >>> 1-poisson(40-1, 0.3, 100)
    0.046253037645840323