What is integer overflow in R and how can it happen?
Asked Answered
V

4

42

I have some calculation going on and get the following warning (i.e. not an error):

Warning messages:
1: In sum(myvar, na.rm = T) :
Integer overflow - use sum(as.numeric(.))

In this thread people state that integer overflows simply don't happen. Either R isn't overly modern or they are not right. However, what am I supposed to do here? If I use as.numeric as the warning suggests I might not account for the fact that information is lost way before. myvar is read form a .csv file, so shouldn't R figure out that some bigger field is needed? Does it already cut off something?

What's the max length of integer or numeric? Would you suggest any other field type / mode?

EDIT: I run:

R version 2.13.2 (2011-09-30) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) within R Studio

Vexed answered 10/1, 2012 at 14:24 Comment(0)
D
45

You can answer many of your questions by reading the help page ?integer. It says:

R uses 32-bit integers for integer vectors, so the range of representable integers is restricted to about +/-2*10^9.

Expanding to larger integers is under consideration by R Core but it's not going to happen in the near future.

If you want a "bignum" capacity then install Martin Maechler's Rmpfr package [PDF]. I recommend the 'Rmpfr' package because of its author's reputation. Martin Maechler is also heavily involved with the Matrix package development, and in R Core as well. There are alternatives, including arithmetic packages such as 'gmp', 'Brobdingnag' and 'Ryacas' package (the latter also offers a symbolic math interface).

Next, to respond to the critical comments in the answer you linked to, and how to assess the relevance to your work, consider this: If there were the same statistical functionality available in one of those "modern" languages as there is in R, you would probably see a user migration in that direction. But I would say that migration, and certainly growth, is in the R direction at the moment. R was built by statisticians for statistics.

There was at one time a Lisp variant with a statistics package, Xlisp-Stat, but its main developer and proponent is now a member of R-Core. On the other hand one of the earliest R developers, Ross Ihaka, suggests working toward development in a Lisp-like language [PDF]. There is a compiled language called Clojure (pronounced as English speakers would say "closure") with an experimental interface, Rincanter.

Update:

The new versions of R (3.0.+) has 53 bit integers of a sort (using the numeric mantissa). When an "integer" vector element is assigned a value in excess of '.Machine$integer.max', the entire vector is coerced to "numeric", a.k.a. "double". Maximum value for integers remains as it was, however, there may be coercion of integer vectors to doubles to preserve accuracy in cases that would formerly generate overflow. Unfortunately, the length of lists, matrix and array dimensions, and vectors is still set at integer.max.

When reading in large values from files, it is probably safer to use character-class as the target and then manipulate. If there is coercion to NA values, there will be a warning.

Drennen answered 10/1, 2012 at 14:37 Comment(3)
The gmp package may also be of interestNikolia
I'm doing a DT[,sapply(.SD,sum,na.rm=T)] with a data.table filled with 0,1 and NA, with 2 million rows. And I get the overflow message, but the maximum number generated should be less than 2 million. What could happen?Palladic
I think you should post more information. Offhand I would guess that creating a matrix (as sapply would attempt to do when the default for 'simplify' is unchanged) would require multiplying the number of rows by the number of columns to get the length of the argument supplied to sum, That might be more than you expected.Drennen
T
26

In short, integer is an exact type with limited range, and numeric is a floating-point type that can represent a much wider range of value but is inexact. See the help pages (?integer and ?numeric) for further details.

As to the overflow, here is an explanation by Brian D. Ripley:

It means that you are taking the mean [in your case, the sum -- @aix] of some very large integers, and the calculation is overflowing. It is just a warning.

This will not happen in the next release of R.

You can specify that a number is an integer by giving it the suffix L, for example, 1L is the integer one, as opposed to 1 which is a floating point one, with class "numeric".

The largest integer that you can create on your machine is given by .Machine$integer.max.

> .Machine$integer.max
[1] 2147483647
> class(.Machine$integer.max)
[1] "integer"

Adding a positive integer to this causes an overflow, returning NA.

> .Machine$integer.max + 1L
[1] NA
Warning message:
In .Machine$integer.max + 1L : NAs produced by integer overflow
> class(.Machine$integer.max + 1L)
[1] "integer"

You can get round this limit by adding floating point values instead.

> .Machine$integer.max + 1
[1] 2147483648
> class(.Machine$integer.max + 1)
[1] "numeric"

Since in your case the warning is issued by sum, this indicates that the overflow happens when the numbers are added together. The suggested workaround sum(as.numeric(.)) should do the trick.

Tsai answered 10/1, 2012 at 14:35 Comment(4)
ok, what if I want to have an exact calculation and have big numbers? Exactly, overflows are created when numbers are added. Can I have an exact result anyway?Vexed
I've fixed the description of what happens when you add numbers to the largest integer.Doer
... but try this: class(sum(c(.Machine$integer.max, as.integer(1)))) for me I get an integer overflow (using 2.14).Villeneuve
@Dason: Yup, as.integer(1) is the same as 1L so you don't get conversion to floating point.Doer
D
5

What's the max length of integer or numeric?

Vectors are currently indexed with an integer, so the max length is given by .Machine$integer.max. As DWin noted, all versions of R currently use 32-bit integers, so this will be 2^31 - 1, or a little over 2 billion.

Unless you are packing some serious hardware (or you are reading this in the future; hello from 2012) you won't have enough memory to allocate vectors that long.

I remember a discussion where R-core (Brian Ripley, I think) suggested that the next step could be to index vectors with the mantissa of doubles, or something clever like that, effectively giving 48-bits of index. Sadly, I can't find that discussion.


In addition to the Rmpfr package, if you are suffering integer overflow, you might want to try the int64 package.

Doer answered 10/1, 2012 at 17:53 Comment(0)
S
1

If c = a - b overflows because a and b are integers, try the following:

c = as.double(a - b)
Simmonds answered 14/11, 2019 at 9:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.