possible bug in `rbinom()` for large numbers of trials

Asked 7/10, 2014 at 16:49 Answered 15/8, 2024 at 21:24

I've been writing some code that iteratively performs binomial draws (using rbinom) and for some callee arguments I can end up with the size being large, which causes R (3.1.1, both official or homebrew builds tested—so unlikely to be compiler related) to return an unexpected NA. For example:

rbinom(1,2^32,0.95)

is what I'd expect to work, but gives NA back. However, running with size=2^31 or prob≤0.5 works.

The fine manual mentions inversion being used when size < .Machine$integer.max is false, could this be the issue?

Robinia answered 7/10, 2014 at 16:49 Comment(4)

It also occurs to me that there's a symmetry in binomial chances, i.e. "win = N- lose" ,so just do K - rbinom(1,K,.05) – Pellerin 7/10, 2014 at 17:34

I think @CarlWitthoft figured it out, at least in part. As you pointed out, R can't handle integers bigger than 2^32, and it isn't smart enough to know that rbinom(1,2^32,0.95) == 1 - rbinom(1,2^32,0.05). – Galarza 8/10, 2014 at 14:3

@ssdecontrol R handles large ints by casting them to numeric aka "double." However, somewhere in the rbinom code, as Roland pointed out, something either got forced to int and blew up, or some other dumb thing happened. – Pellerin 8/10, 2014 at 15:33

I think the problem is in src/library/stats/src/random.c, where rbinom is defined via the DEFRAND2_INT macro, which coerces the result to integer when returning from C to R. – Gambado 8/10, 2014 at 17:8

Looking at the source rbinom does the equivalent (in C code) of the following for such large sizes:

qbinom(runif(n), size, prob, FALSE)

And indeed:

set.seed(42)
rbinom(1,2^31,0.95)
#[1] 2040095619
set.seed(42)
qbinom(runif(1), 2^31, 0.95, F)
#[1] 2040095619

However:

set.seed(42)
rbinom(1,2^32,0.95)
#[1] NA
set.seed(42)
qbinom(runif(1), 2^32, 0.95, F)
#[1] 4080199349

As @BenBolker points out rbinom returns an integer and if the return value is larger than .Machine$integer.max, e.g., larger than 2147483647 on my machine, NA gets returned. In contrast qbinom returns a double. I don't know why and it doesn't seem to be documented.

So, it seems like there is at least undocumented behavior and you should probably report it.

Igenia answered 7/10, 2014 at 17:31 Comment(4)

Thanks, it looked like a bug to me as well. Will do something like @Ben suggests until a fix gets committed. – Robinia 8/10, 2014 at 13:23

On second thought, I think the only "bug" is that the results of rbinom() are coerced to integer, which becomes NA when greater than .Machine$integer.max. You could just use qbinom(runif(n),size,prob), which would certainly be more compact than my answer. – Gambado 8/10, 2014 at 14:36

@BenBolker That makes me wonder why rbinom returns an integer, but qbinom a numeric. – Igenia 8/10, 2014 at 16:39

@Igenia good point. At the very least, then, it's an inconsistency. Sam, have you reported this? It might be worth bringing it up on [email protected] first ... – Gambado 8/10, 2014 at 17:1

I agree that (in the absence of documentation saying this is a problem) that this is a bug. A reasonable workaround would be using the Normal approximation, which should be very very good indeed (and faster) for such large values. (I originally meant this to be short and simple but it ended up getting a little bit out of hand.)

rbinom_safe <- function(n,size,prob,max.size=2^31) {
    maxlen <- max(length(size),length(prob),n)
    prob <- rep(prob,length.out=maxlen)
    size <- rep(size,length.out=maxlen)
    res <- numeric(n)
    bigvals <- size>max.size
    if (nbig <- sum(bigvals>0)) {
        m <- (size*prob)[bigvals]
        sd <- sqrt(size*prob*(1-prob))[bigvals]
        res[bigvals] <- round(rnorm(nbig,mean=m,sd=sd))
    }
    if (nbig<n) {
        res[!bigvals] <- rbinom(n-nbig,size[!bigvals],prob[!bigvals])
    }
    return(res)
}

set.seed(101)
size <- c(1,5,10,2^31,2^32)
rbinom_safe(5,size,prob=0.95)
rbinom_safe(5,3,prob=0.95)
rbinom_safe(5,2^32,prob=0.95)

The Normal approximation should work reasonably well whenever the mean is many standard deviations away from 0 or 1 (whichever is closer). For large N this should be OK unless p is very extreme. For example:

n <- 2^31
p <- 0.95
m <- n*p
sd <- sqrt(n*p*(1-p))
set.seed(101)
rr <- rbinom_safe(10000,n,prob=p)
hist(rr,freq=FALSE,col="gray",breaks=50)
curve(dnorm(x,mean=m,sd=sd),col=2,add=TRUE)
dd <- round(seq(m-5*sd,m+5*sd,length.out=101))
midpts <- (dd[-1]+dd[-length(dd)])/2
lines(midpts,c(diff(sapply(dd,pbinom,size=n,prob=p))/diff(dd)[1]),
      col="blue",lty=2)

enter image description here

Gambado answered 7/10, 2014 at 18:34 Comment(2)

Isn't the normal approximation only recommended when prob is near 0.5? I'd expect values between 0.7 and 0.975 to be used commonly. I guess I could test to see if it works well enough! – Robinia 8/10, 2014 at 13:21

see edits. (also see Wikipedia for more information on the Normal approx). – Gambado 8/10, 2014 at 14:0

This is the intended behaviour, but there are two issues: 1) The NA induced by coercion should raise a warning 2) The fact that discrete random variables have storage mode integer should be documented.

I have fixed 1) and will modify the documentation to fix 2) when I have a little more time.

Divided answered 9/10, 2014 at 14:43 Comment(0)

It appears that, despite Martyn Plummer saying in his answer (from 2014) that returning an integer value (and NA-with-a-warning when the value overflowed) was the intended behaviour, this is no longer true in at least the development version of R (4.5.0/SVN r86945).

set.seed(101); str(rbinom(1,1000,0.95))
##  int 942
set.seed(101); str(rbinom(1,2^50,0.95))
##  num 1.07e+15

Gambado answered 15/8, 2024 at 21:24 Comment(0)

Recommended topics

Hot tags