How can I efficiently calculate the binomial cumulative distribution function?
Asked Answered
T

10

18

Let's say that I know the probability of a "success" is P. I run the test N times, and I see S successes. The test is akin to tossing an unevenly weighted coin (perhaps heads is a success, tails is a failure).

I want to know the approximate probability of seeing either S successes, or a number of successes less likely than S successes.

So for example, if P is 0.3, N is 100, and I get 20 successes, I'm looking for the probability of getting 20 or fewer successes.

If, on the other hadn, P is 0.3, N is 100, and I get 40 successes, I'm looking for the probability of getting 40 our more successes.

I'm aware that this problem relates to finding the area under a binomial curve, however:

  1. My math-fu is not up to the task of translating this knowledge into efficient code
  2. While I understand a binomial curve would give an exact result, I get the impression that it would be inherently inefficient. A fast method to calculate an approximate result would suffice.

I should stress that this computation has to be fast, and should ideally be determinable with standard 64 or 128 bit floating point computation.

I'm looking for a function that takes P, S, and N - and returns a probability. As I'm more familiar with code than mathematical notation, I'd prefer that any answers employ pseudo-code or code.

Triphammer answered 8/7, 2009 at 1:13 Comment(5)
Why are big numbers a problem? If you're using a binomial distribution for large values, it must mean that you can't accept the error that results from approximating the binomial distribution with a continuous distribution, so you should be willing to sacrifice speed for accuracy by using a bignum library like GMP.Foretopmast
In that case use the Gaussian error function (en.wikipedia.org/wiki/Error_function), or rather an approximation to it.Misplace
Or there are some bounds given here: en.wikipedia.org/wiki/Q-functionMisplace
When did 128 bit floating point computation become standard?Aerostatics
It's too bad this questions settles for an approximation. It would be interesting to see an efficient exact solution. I have my own exact solution which I think is fast but I want to know if there is a better method.Aerostatics
S
32

Exact Binomial Distribution

def factorial(n): 
    if n < 2: return 1
    return reduce(lambda x, y: x*y, xrange(2, int(n)+1))

def prob(s, p, n):
    x = 1.0 - p

    a = n - s
    b = s + 1

    c = a + b - 1

    prob = 0.0

    for j in xrange(a, c + 1):
        prob += factorial(c) / (factorial(j)*factorial(c-j)) \
                * x**j * (1 - x)**(c-j)

    return prob

>>> prob(20, 0.3, 100)
0.016462853241869437

>>> 1-prob(40-1, 0.3, 100)
0.020988576003924564

Normal Estimate, good for large n

import math
def erf(z):
        t = 1.0 / (1.0 + 0.5 * abs(z))
        # use Horner's method
        ans = 1 - t * math.exp( -z*z -  1.26551223 +
                                                t * ( 1.00002368 +
                                                t * ( 0.37409196 + 
                                                t * ( 0.09678418 + 
                                                t * (-0.18628806 + 
                                                t * ( 0.27886807 + 
                                                t * (-1.13520398 + 
                                                t * ( 1.48851587 + 
                                                t * (-0.82215223 + 
                                                t * ( 0.17087277))))))))))
        if z >= 0.0:
                return ans
        else:
                return -ans

def normal_estimate(s, p, n):
    u = n * p
    o = (u * (1-p)) ** 0.5

    return 0.5 * (1 + erf((s-u)/(o*2**0.5)))

>>> normal_estimate(20, 0.3, 100)
0.014548164531920815

>>> 1-normal_estimate(40-1, 0.3, 100)
0.024767304545069813

Poisson Estimate: Good for large n and small p

import math

def poisson(s,p,n):
    L = n*p

    sum = 0
    for i in xrange(0, s+1):
        sum += L**i/factorial(i)

    return sum*math.e**(-L)

>>> poisson(20, 0.3, 100)
0.013411150012837811
>>> 1-poisson(40-1, 0.3, 100)
0.046253037645840323
Selenaselenate answered 11/7, 2009 at 19:42 Comment(2)
The normal approximation gets better a lot faster if you use the continuity correction, and interpret that an integer valued variable <= n corresponds to a continuous variable <n+0.5. So, use s+0.5 instead of s in normal_estimate.Formalism
I got this: >>> poisson(20, 0.3, 100) 0.0352846184542287 ... how can your result look like that?Illampu
B
15

I was on a project where we needed to be able to calculate the binomial CDF in an environment that didn't have a factorial or gamma function defined. It took me a few weeks, but I ended up coming up with the following algorithm which calculates the CDF exactly (i.e. no approximation necessary). Python is basically as good as pseudocode, right?

import numpy as np

def binomial_cdf(x,n,p):
    cdf = 0
    b = 0
    for k in range(x+1):
        if k > 0:
            b += + np.log(n-k+1) - np.log(k) 
        log_pmf_k = b + k * np.log(p) + (n-k) * np.log(1-p)
        cdf += np.exp(log_pmf_k)
    return cdf

Performance scales with x. For small values of x, this solution is about an order of magnitude faster than scipy.stats.binom.cdf, with similar performance at around x=10,000.

I won't go into a full derivation of this algorithm because stackoverflow doesn't support MathJax, but the thrust of it is first identifying the following equivalence:

  • For all k > 0, sp.misc.comb(n,k) == np.prod([(n-k+1)/k for k in range(1,k+1)])

Which we can rewrite as:

  • sp.misc.comb(n,k) == sp.misc.comb(n,k-1) * (n-k+1)/k

or in log space:

  • np.log( sp.misc.comb(n,k) ) == np.log(sp.misc.comb(n,k-1)) + np.log(n-k+1) - np.log(k)

Because the CDF is a summation of PMFs, we can use this formulation to calculate the binomial coefficient (the log of which is b in the function above) for PMF_{x=i} from the coefficient we calculated for PMF_{x=i-1}. This means we can do everything inside a single loop using accumulators, and we don't need to calculate any factorials!

The reason most of the calculations are done in log space is to improve the numerical stability of the polynomial terms, i.e. p^x and (1-p)^(1-x) have the potential to be extremely large or extremely small, which can cause computational errors.

EDIT: Is this a novel algorithm? I've been poking around on and off since before I posted this, and I'm increasingly wondering if I should write this up more formally and submit it to a journal.

Buddhism answered 24/8, 2017 at 19:6 Comment(2)
niiiiiiiiiiice!Wey
this came in quite useful on a project of mine. thank you!Thaumaturgy
M
5

I think you want to evaluate the incomplete beta function.

There's a nice implementation using a continued fraction representation in "Numerical Recipes In C", chapter 6: 'Special Functions'.

Madras answered 8/7, 2009 at 1:38 Comment(4)
Can you explain how the incomplete beta function solves this problem?Triphammer
The probability P of an event occurring with probability p, k or more time in n trials, is the cumulative binomial probability. It's related to the incomplete beta function. Have a look at equation 6.4.12 on page 229 of "Numerical Recipes in C".Madras
see also en.wikipedia.org/wiki/…Diversification
As a note to others who might become confused like I was: Wolfram Alpha refers to this function as the "regularized incomplete beta function" with the function name BetaRegularized. So with a probability of p, trials t and successful trials s, you want to evaluate: BetaRegularized(p, s, t - s + 1)Hass
U
4

I can't totally vouch for the efficiency, but Scipy has a module for this

from scipy.stats.distributions import binom
binom.cdf(successes, attempts, chance_of_success_per_attempt)
Ur answered 8/2, 2012 at 23:40 Comment(0)
F
3

An efficient and, more importantly, numerical stable algorithm exists in the domain of Bezier Curves used in Computer Aided Design. It is called de Casteljau's algorithm used to evaluate the Bernstein Polynomials used to define Bezier Curves.

I believe that I am only allowed one link per answer so start with Wikipedia - Bernstein Polynomials

Notice the very close relationship between the Binomial Distribution and the Bernstein Polynomials. Then click through to the link on de Casteljau's algorithm.

Lets say I know the probability of throwing a heads with a particular coin is P. What is the probability of me throwing the coin T times and getting at least S heads?

  • Set n = T
  • Set beta[i] = 0 for i = 0, ... S - 1
  • Set beta[i] = 1 for i = S, ... T
  • Set t = p
  • Evaluate B(t) using de Casteljau

or at most S heads?

  • Set n = T
  • Set beta[i] = 1 for i = 0, ... S
  • Set beta[i] = 0 for i = S + 1, ... T
  • Set t = p
  • Evaluate B(t) using de Casteljau

Open source code probably exists already. NURBS Curves (Non-Uniform Rational B-spline Curves) are a generalization of Bezier Curves and are widely used in CAD. Try openNurbs (the license is very liberal) or failing that Open CASCADE (a somewhat less liberal and opaque license). Both toolkits are in C++, though, IIRC, .NET bindings exist.

Filibertofilibuster answered 8/7, 2009 at 5:28 Comment(0)
D
2

If you are using Python, no need to code it yourself. Scipy got you covered:

from scipy.stats import binom
# probability that you get 20 or less successes out of 100, when p=0.3
binom.cdf(20, 100, 0.3)
>>> 0.016462853241869434

# probability that you get exactly 20 successes out of 100, when p=0.3
binom.pmf(20, 100, 0.3)
>>> 0.0075756449257260777
Diaconicum answered 8/4, 2016 at 2:7 Comment(0)
F
1

From the portion of your question "getting at least S heads" you want the cummulative binomial distribution function. See http://en.wikipedia.org/wiki/Binomial_distribution for the equation, which is described as being in terms of the "regularized incomplete beta function" (as already answered). If you just want to calculate the answer without having to implement the entire solution yourself, the GNU Scientific Library provides the function: gsl_cdf_binomial_P and gsl_cdf_binomial_Q.

Foreshank answered 8/7, 2009 at 4:59 Comment(0)
A
1

The DCDFLIB Project has C# functions (wrappers around C code) to evaluate many CDF functions, including the binomial distribution. You can find the original C and FORTRAN code here. This code is well tested and accurate.

If you want to write your own code to avoid being dependent on an external library, you could use the normal approximation to the binomial mentioned in other answers. Here are some notes on how good the approximation is under various circumstances. If you go that route and need code to compute the normal CDF, here's Python code for doing that. It's only about a dozen lines of code and could easily be ported to any other language. But if you want high accuracy and efficient code, you're better off using third party code like DCDFLIB. Several man-years went into producing that library.

Aharon answered 12/7, 2009 at 2:2 Comment(0)
G
0

Try this one, used in GMP. Another reference is this.

Gabrielson answered 8/7, 2009 at 1:21 Comment(2)
My bad for not explaining it better, but the title did a poor job of explaining what I needed. Please refer to the coin tossing problem I mention in the body of the question.Triphammer
Hm... someone just upvoted the 2nd google hit for "binomial coefficient algorithm"...Bradway
X
0
import numpy as np
np.random.seed(1)
x=np.random.binomial(20,0.6,10000) #20 flips of coin,probability of 
                                 heads percentage and 10000 times 
                                  done.
sum(x>12)/len(x)

The output is 41% of times we got 12 heads.
Xerarch answered 23/8, 2018 at 9:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.