I agree that (in the absence of documentation saying this is a problem) that this is a bug. A reasonable workaround would be using the Normal approximation, which should be very very good indeed (and faster) for such large values. (I originally meant this to be short and simple but it ended up getting a little bit out of hand.)
rbinom_safe <- function(n,size,prob,max.size=2^31) {
maxlen <- max(length(size),length(prob),n)
prob <- rep(prob,length.out=maxlen)
size <- rep(size,length.out=maxlen)
res <- numeric(n)
bigvals <- size>max.size
if (nbig <- sum(bigvals>0)) {
m <- (size*prob)[bigvals]
sd <- sqrt(size*prob*(1-prob))[bigvals]
res[bigvals] <- round(rnorm(nbig,mean=m,sd=sd))
}
if (nbig<n) {
res[!bigvals] <- rbinom(n-nbig,size[!bigvals],prob[!bigvals])
}
return(res)
}
set.seed(101)
size <- c(1,5,10,2^31,2^32)
rbinom_safe(5,size,prob=0.95)
rbinom_safe(5,3,prob=0.95)
rbinom_safe(5,2^32,prob=0.95)
The Normal approximation should work reasonably well whenever the mean is many standard deviations away from 0 or 1 (whichever is closer). For large N this should be OK unless p is very extreme. For example:
n <- 2^31
p <- 0.95
m <- n*p
sd <- sqrt(n*p*(1-p))
set.seed(101)
rr <- rbinom_safe(10000,n,prob=p)
hist(rr,freq=FALSE,col="gray",breaks=50)
curve(dnorm(x,mean=m,sd=sd),col=2,add=TRUE)
dd <- round(seq(m-5*sd,m+5*sd,length.out=101))
midpts <- (dd[-1]+dd[-length(dd)])/2
lines(midpts,c(diff(sapply(dd,pbinom,size=n,prob=p))/diff(dd)[1]),
col="blue",lty=2)
K - rbinom(1,K,.05)
– Pellerinrbinom(1,2^32,0.95) == 1 - rbinom(1,2^32,0.05)
. – GalarzaR
handles large ints by casting them tonumeric
aka "double." However, somewhere in the rbinom code, as Roland pointed out, something either got forced toint
and blew up, or some other dumb thing happened. – Pellerinrbinom
is defined via theDEFRAND2_INT
macro, which coerces the result to integer when returning from C to R. – Gambado