Is there a certain R-gotcha that had you really surprised one day? I think we'd all gain from sharing these.
Here's mine: in list indexing, my.list[[1]]
is not my.list[1]
. Learned this in the early days of R.
Is there a certain R-gotcha that had you really surprised one day? I think we'd all gain from sharing these.
Here's mine: in list indexing, my.list[[1]]
is not my.list[1]
. Learned this in the early days of R.
Removing rows in a dataframe will cause non-uniquely named rows to be added, which then errors out:
> a<-data.frame(c(1,2,3,4),c(4,3,2,1))
> a<-a[-3,]
> a
c.1..2..3..4. c.4..3..2..1.
1 1 4
2 2 3
4 4 1
> a[4,1]<-1
> a
Error in data.frame(c.1..2..3..4. = c("1", "2", "4", "1"), c.4..3..2..1. = c(" 4", :
duplicate row.names: 4
So what is going on here is:
A four row data.frame is created, so the rownames are c(1,2,3,4)
The third row is deleted, so the rownames are c(1,2,4)
A fourth row is added, and R automatically sets the row name equal to the index i.e. 4, so the row names are c(1,2,4,4). This is illegal because row names should be unique. I don't see why this type of behavior should be allowed by R. It seems to me that R should provide a unique row name.
data.frame
function –
Margrettmarguerie 4.1 1 NA
–
Impeccant [Hadley pointed this out in a comment.]
When using a sequence as an index for iteration, it's better to use the seq_along()
function rather than something like 1:length(x)
.
Here I create a vector and both approaches return the same thing:
> x <- 1:10
> 1:length(x)
[1] 1 2 3 4 5 6 7 8 9 10
> seq_along(x)
[1] 1 2 3 4 5 6 7 8 9 10
Now make the vector NULL
:
> x <- NULL
> seq_along(x) # returns an empty integer; good behavior
integer(0)
> 1:length(x) # wraps around and returns a sequence; this is bad
[1] 1 0
This can cause some confusion in a loop:
> for(i in 1:length(x)) print(i)
[1] 1
[1] 0
> for(i in seq_along(x)) print(i)
>
The automatic creation of factors when you load data. You unthinkingly treat a column in a data frame as characters, and this works well until you do something like trying to change a value to one that isn't a level. This will generate a warning but leave your data frame with NA's in it ...
When something goes unexpectedly wrong in your R script, check that factors aren't to blame.
options("stringsAsFactors"=FALSE)
in your startup file(s) to change this. –
Entrenchment data.frame
constructor. Has bit me from behind many times as well. –
Revile Forgetting the drop=FALSE argument in subsetting matrices down to single dimension and thereby dropping the object class as well:
R> X <- matrix(1:4,2)
R> X
[,1] [,2]
[1,] 1 3
[2,] 2 4
R> class(X)
[1] "matrix"
R> X[,1]
[1] 1 2
R> class(X[,1])
[1] "integer"
R> X[,1, drop=FALSE]
[,1]
[1,] 1
[2,] 2
R> class(X[,1, drop=FALSE])
[1] "matrix"
R>
Removing rows in a dataframe will cause non-uniquely named rows to be added, which then errors out:
> a<-data.frame(c(1,2,3,4),c(4,3,2,1))
> a<-a[-3,]
> a
c.1..2..3..4. c.4..3..2..1.
1 1 4
2 2 3
4 4 1
> a[4,1]<-1
> a
Error in data.frame(c.1..2..3..4. = c("1", "2", "4", "1"), c.4..3..2..1. = c(" 4", :
duplicate row.names: 4
So what is going on here is:
A four row data.frame is created, so the rownames are c(1,2,3,4)
The third row is deleted, so the rownames are c(1,2,4)
A fourth row is added, and R automatically sets the row name equal to the index i.e. 4, so the row names are c(1,2,4,4). This is illegal because row names should be unique. I don't see why this type of behavior should be allowed by R. It seems to me that R should provide a unique row name.
data.frame
function –
Margrettmarguerie 4.1 1 NA
–
Impeccant First, let me say that I understand fundamental problems of representing numbers in a binary system. Nevertheless, one problem that I think could be easily improved is the representation of numbers when the decimal value is beyond R's typical scope of presentation.
x <- 10.2 * 100
x
1020
as.integer(x)
1019
I don't mind if the result is represented as an integer when it really can be represented as an integer. For example, if the value really was 1020 then printing that for x would be fine. But something as simple as 1020.0 in this case when printing x would have made it more obvious that the value was not an integer and not representable as one. R should default to some kind of indication when there is an extremely small decimal component that isn't presented.
1.000000000001
would print as 1.
; the other alternative would be to print an explict L
after integers, but that would be ugly. –
Fractocumulus It can be annoying to have to allow for combinations of NA
, NaN
and Inf
. They behave differently, and tests for one won't necessarily work for the others:
> x <- c(NA,NaN,Inf)
> is.na(x)
[1] TRUE TRUE FALSE
> is.nan(x)
[1] FALSE TRUE FALSE
> is.infinite(x)
[1] FALSE FALSE TRUE
However the safest way to test any of these trouble-makers is:
> is.finite(x)
[1] FALSE FALSE FALSE
NA
as "I don't know (yet)". but my interpretation does not fit with is.infinite(NA)
and is.finite(NA)
returning FALSE
: I had expected NA
. –
Puree Always test what happens when you have an NA
!
One thing that I always need to pay careful attention to (after many painful experiences) is NA
values. R functions are easy to use, but no manner of programming will overcome issues with your data.
For instance, any net vector operation with an NA
is equal to NA
. This is "surprising" on the face of it:
> x <- c(1,1,2,NA)
> 1 + NA
[1] NA
> sum(x)
[1] NA
> mean(x)
[1] NA
This gets extrapolated out into other higher-level functions.
In other words, missing values frequently have as much importance as measured values by default. Many functions have na.rm=TRUE/FALSE
defaults; it's worth spending some time deciding how to interpret these default settings.
Edit 1: Marek makes a great point. NA
values can also cause confusing behavior in indexes. For instance:
> TRUE && NA
[1] NA
> FALSE && NA
[1] FALSE
> TRUE || NA
[1] TRUE
> FALSE || NA
[1] NA
This is also true when you're trying to create a conditional expression (for an if statement):
> any(c(TRUE, NA))
[1] TRUE
> any(c(FALSE, NA))
[1] NA
> all(c(TRUE, NA))
[1] NA
When these NA values end up as your vector indexes, many unexpected things can follow. This is all good behavior for R, because it means that you have to be careful with missing values. But it can cause major headaches at the beginning.
(1:3)[c(TRUE,FALSE,NA)]
gives 1,NA
. It is easy to trap in this when you create logical vector on NA-contained vector (1:3)[c(1,2,NA)<2]
. –
Sallysallyann Forgetting that strptime()
and friends return POSIXt POSIXlt
where length()
is always nine -- converting to POSIXct
helps:
R> length(strptime("2009-10-07 20:21:22", "%Y-%m-%d %H:%M:%S"))
[1] 9
R> length(as.POSIXct(strptime("2009-10-07 20:21:22", "%Y-%m-%d %H:%M:%S")))
[1] 1
R>
The round
function always rounds to the even number.
> round(3.5)
[1] 4
> round(4.5)
[1] 4
Math on integers is subtly different from doubles (and sometimes complex is weird too)
UPDATE They fixed some things in R 2.15
1^NA # 1
1L^NA # NA
(1+0i)^NA # NA
0L %/% 0L # 0L (NA from R 2.15)
0 %/% 0 # NaN
4L %/% 0L # 0L (NA from R 2.15)
4 %/% 0 # Inf
I'm surprised that no one mention this but:
T
& F
can be override, TRUE
& FALSE
don't.
Example:
x <- sample(c(0,1,NA), 100, T)
T <- 0:10
mean(x, na.rm=T)
# Warning in if (na.rm) x <- x[!is.na(x)] :
# the condition has length > 1 and only the first element will be used
# Calls: mean -> mean.default
# [1] NA
plot(rnorm(7), axes=T)
# Warning in if (axes) { :
# the condition has length > 1 and only the first element will be used
# Calls: plot -> plot.default
# Warning in if (frame.plot) localBox(...) :
# the condition has length > 1 and only the first element will be used
# Calls: plot -> plot.default
[edit] ctrf+F
trick me. Shane mention about this in his comment.
T <- FALSE ; F <- TRUE
inside someone's ~/.Rprofile –
Rosanarosane Reading in data can be more problematic than you may think. Today I found that if you use read.csv(), if a line in the .csv file is blank, read.csv() automatically skips it. This makes sense for most applications, but if you're automatically extracting data from (for example) row 27 from several thousand files, and some of the preceding rows may or may not be blank, if you're not careful things can go horribly wrong.
I now use
data1 <- read.table(file_name, blank.lines.skip = F, sep = ",")
When you're importing data, check that you're doing what you actually think you're doing again and again and again...
The tricky behaviour of the all.equal()
function.
One of my continuos errors is comparing a set of floating point numbers. I have a CSV like:
... mu, tau, ...
... 0.5, 1.7, ...
Reading the file and trying to subset the data sometimes works, sometimes fails - of course, due to falling into the pits of the floating point trap again and again. At first, the data contains only integer values, then later on it always transforms into real values, you know the story. Comparing should be done with the all.equal()
function instead of the ==
operator, but of course, the code I first wrote used the latter approach.
Yeah, cool, but all.equal()
returns TRUE
for equal numbers, but a textual error message if it fails:
> all.equal(1,1)
[1] TRUE
> all.equal(1:10, 1:5)
[1] "Numeric: lengths (10, 5) differ"
> all.equal(1:10, c(1:5,1:5))
[1] "Mean relative difference: 0.625"
The solution is using isTRUE()
function:
if (!isTRUE(all.equal(x, y, tolerance=doubleErrorRate))) {
...
}
How many times had I got to read the all.equals()
description...
This one hurt so much that I spent hours adding comments to a bug-report. I didn't get my wish, but at least the next version of R will generate an error.
R> nchar(factor(letters))
[1] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Update: As of R 3.2.0 (probably earlier), this example now generates an error message. As mentioned in the comments below, a factor is NOT a vector and nchar() requires a vector.
R> nchar(factor(letters))
Error in nchar(factor(letters)) : 'nchar()' requires a character vector
R> is.vector(factor(letters))
[1] FALSE
factor(letters)
may not be a vector
, but it can be treated as such, you can see it as a vector of factors. The first comment is close to what is happening here -- internally, factors are integers: typeof(factor(letters))
. So, this output is the same as nchar(1:length(letters))
. 1
when you have one digit, 2
for two digits. –
Cassandracassandre accidentally listing source code of a function by forgetting to include empty parentheses: e.g. "ls" versus "ls()"
true & false don't cut it as pre-defined constants, like in Matlab, C++, Java, Python; must use TRUE & FALSE
invisible return values: e.g. ".packages()" returns nothing, while "(.packages())" returns a character vector of package base names
For instance, the number 3.14 is a numerical constant, but the expressions +3.14 and -3.14 are calls to the functions +
and -
:
> class(quote(3.14))
[1] "numeric"
> class(quote(+3.14))
[1] "call"
> class(quote(-3.14))
[1] "call"
See Section 13.2 in John Chambers book Software for Data Analysis - Programming with R
Partial matching in the $
operator:
This applies to lists, but also on data.frame
df1 <- data.frame(foo=1:10, foobar=10:1)
df2 <- data.frame(foobar=10:1)
df1$foo # Correctly gets the foo column
df2$foo # Expect NULL, but this returns the foobar column!!!
# So, should use double bracket instead:
df1[["foo"]]
df2[["foo"]]
The [[
operator also has an exact
flag, but it is thankfully TRUE
by default.
Partial matching also affects attr
:
x1 <- structure(1, foo=1:10, foobar=10:1)
x2 <- structure(2, foobar=10:1)
attr(x1, "foo") # Correctly gets the foo attribute
attr(x2, "foo") # Expect NULL, but this returns the foobar attribute!!!
# So, should use exact=TRUE
attr(x1, "foo", exact=TRUE)
attr(x2, "foo", exact=TRUE)
Automatic repeating of vectors ("recycling") used as indices:
R> all.numbers <- c(1:5)
R> all.numbers
[1] 1 2 3 4 5
R> good.idxs <- c(T,F,T)
R> #note unfortunate length mismatch
R> good.numbers <- all.numbers[good.idxs]
R> good.numbers
[1] 1 3 4
R> #wtf?
R> #why would you repeat the vector used as an index
R> #without even a warning?
Zero-length vectors have some quirks:
R> kk=vector(mode="numeric",length=0)
R> kk
numeric(0)
R> sum(kk)
[1] 0
R> var(kk)
[1] NA
prod(numeric(0))==1
too. I'm sure this has been discussed before on the r mailing lists, but it's a good point. –
Fractocumulus Working with lists, there are a couple of unintuitive things:
Of course, the difference between [
and [[
takes some getting used to. For lists, the [
returns a list of (potentially 1) elements whereas the [[
returns the element inside the list.
List creation:
# When you're used to this:
x <- numeric(5) # A vector of length 5 with zeroes
# ... this might surprise you
x <- list(5) # A list with a SINGLE element: the value 5
# This is what you have to do instead:
x <- vector('list', 5) # A vector of length 5 with NULLS
So, how to insert NULL into a list?
x <- list("foo", 1:3, letters, LETTERS) # A sample list
x[[2]] <- 1:5 # Put 1:5 in the second element
# The obvious way doesn't work:
x[[2]] <- NULL # This DELETES the second element!
# This doesn't work either:
x[2] <- NULL # This DELETES the second element!
# The solution is NOT very intuitive:
x[2] <- list(NULL) # Put NULL in the second element
# Btw, now that we think we know how to delete an element:
x <- 1:10
x[[2]] <- NULL # Nope, gives an ERROR!
x <- x[-2] # This is the only way for atomic vectors (works for lists too)
Finally some advanced stuff like indexing through a nested list:
x <- list(a=1:3, b=list(c=42, d=13, e="HELLO"), f='bar')
x[[c(2,3)]] # HELLO (first selects second element and then it's third element)
x[c(2,3)] # The second and third elements (b and f)
x[[2]] <- 1:5
put 1:5
in second element. And to extend your answer x[1:2] <- 1:2
put 1
in first element and 2
in second, x[1,2]
works for nested list (second element of first element) –
Sallysallyann x[[2]]
<- 1:5 and x[1:2] <- 1:2
that surprising. x[1,2]
should be x[[c(1,2)]]
and I updated the answer. Thanks! –
Semblance x[2] <- 1:5
gives me warning and put 1
in second element of x
. And I was wrong in my comment: I have on my mind difference between x[c(1,2)]
(return 1st and 2nd element) and x[[c(1,2)]]
(return 2nd element of 1st element). –
Sallysallyann One of the big confusion in R is that [i, drop = TRUE]
does drop factor levels, but [i, j, drop = TRUE]
does not!
> df = data.frame(a = c("europe", "asia", "oceania"), b = c(1, 2, 3))
> df$a[1:2, drop = TRUE]
[1] europe asia
Levels: asia europe <---- drops factor levels, works fine
> df[1:2,, drop = TRUE]$a
[1] europe asia
Levels: asia europe oceania <---- does not drops factor levels!
For more info see: drop = TRUE doesn't drop factor levels in data.frame while in vector it does
Coming from compiled language and Matlab, I've gotten occasionally confused about a fundamental aspect of functions in functional languages: they have to be defined before they're used! It's not enough just for them to be parsed by the R interpreter. This mostly rears its head when you use nested functions.
In Matlab you can do:
function f1()
v1 = 1;
v2 = f2();
fprintf('2 == %d\n', v2);
function r1 = f2()
r1 = v1 + 1 % nested function scope
end
end
If you try to do the same thing in R, you have to put the nested function first, or you get an error! Just because you've defined the function, it's not in the namespace until it's assigned to a variable! On the other hand, the function can refer to a variable that has not been defined yet.
f1 <- function() {
f2 <- function() {
v1 + 1
}
v1 <- 1
v2 = f2()
print(sprintf("2 == %d", v2))
}
v1+1
to f3()
in your example, and then define an f3
function before f2
gets called, it still works fine. –
Czech Mine from today: qnorm() takes Probabilities and pnorm() takes Quantiles.
For me it is the counter intuitive way in which when you export a data.frame to a text file using write.csv
, then to import it afterwards you need to add an additional argument to get exactly the same data.frame, like this:
write.csv(m, file = 'm.csv')
read.csv('m.csv', row.names = 1) # Note the row.names argument
I also posted this question in SO and was suggested as an answer to this Q by @BenBolker.
The apply
set of functions does not only work for matrices, but scales up to multi-dimensional array's. In my research I often have a dataset of for example temperature of the atmosphere. This is stored in a multi-dimensional array with dimensions x,y,level,time
, from now on called multi_dim_array
. A mockup example would be:
multi_dim_array = array(runif(96 * 48 * 6 * 100, -50, 50),
dim = c(96, 48, 6, 100))
> str(multi_dim_array)
# x y lev time
num [1:96, 1:48, 1:6, 1:100] 42.4 16 32.3 49.5 24.9 ...
Using apply
one can easily get the:
# temporal mean value
> str(apply(multi_dim_array, 4, mean))
num [1:100] -0.0113 -0.0329 -0.3424 -0.3595 -0.0801 ...
# temporal mean value per gridcell (x,y location)
> str(apply(multi_dim_array, c(1,2), mean))
num [1:96, 1:48] -1.506 0.4553 -1.7951 0.0703 0.2915 ...
# temporal mean value per gridcell and level (x,y location, level)
> str(apply(multi_dim_array, c(1,2,3), mean))
num [1:96, 1:48, 1:6] -3.839 -3.672 0.131 -1.024 -2.143 ...
# Spatial mean per level
> str(apply(multi_dim_array, c(3,4), mean))
num [1:6, 1:100] -0.4436 -0.3026 -0.3158 0.0902 0.2438 ...
This makes the margin
argument to apply
seem much less counter intuitive. I first though, why not use "row" and "col" instead of 1 and 2. But the fact that it also works for array's with more dimensions makes it clear why using margin
like this is preferred.
which.min
and which.max
function opposite expectations when using comparison operator and can even give incorrect answers. So for example trying to figure out which element in a list of sorted numbers is the largest number that is less than a threshold. (i.e. in a sequence from 100 to 200 which is the largest number that is less than 110)
set.seed(420)
x = seq(100, 200)
which(x < 110)
> [1] 1 2 3 4 5 6 7 8 9 10
which.max(x < 110)
> [1] 1
which.min(x < 110)
> [1] 11
x[11]
> [1] 110
max(which(x < 110))
>[1] 10
x[10]
> [1] 109
The dirties gotcha that can be really hard to find! Cutting multi-line expressions like this one:
K <- hyperpar$intcept.sigma2
+ cov.NN.additive(x1$env, x2 = NULL, sigma2_int = hyperpar$env.sigma2_int, sigma2_slope = hyperpar$env.sigma2_slope)
+ hyperpar$env.sigma2 * K.cache$k.env
R will only evaluate the first line, and the other two will just go waste! And it will not say any warning, nothing! This is pretty nasty treachery on unsuspecting user. It must actually be written like this:
K <- hyperpar$intcept.sigma2 +
cov.NN.additive(x1$env, x2 = NULL, sigma2_int = hyperpar$env.sigma2_int, sigma2_slope = hyperpar$env.sigma2_slope) +
hyperpar$env.sigma2 * K.cache$k.env
which is not quite natural way of writing.
This one!
all(c(1,2,3,4) == NULL)
$[1] TRUE
I had this check in my code, I really need both tables to have the same column names:
stopifnot(all(names(x$x$env) == names(x$obsx$env)))
But the check passed (evaluated to TRUE) when x$x$env
didn't even exist!
You can use options(warn = 2)
, which, according to the manual:
If warn is two or larger all warnings are turned into errors.
Indeed, the warnings are turned into errors, but, gotcha! The code still continues running after such errors!!!
source("script.R")
# ...
# Loading required package: bayesmeta
# Failed with error: ‘(converted from warning) there is no package called ‘bayesmeta’’
# computing posterior (co)variances ...
# (script continues running)
...
PS: but some other errors converted from warning do stop the script... so I don't know, I am confused. This one did stop the script:
Error in optimise(psiline, c(0, 2), adiff, a, as.matrix(K), y, d0, mn, :
(converted from warning) NA/Inf replaced by maximum positive value
© 2022 - 2024 — McMap. All rights reserved.
Error: unexpected 'else' in "else"
will pop out when you put a newline after the curly brace in the if statement:if { ... } \n else { ... }
. – Alevinchoose
function.choose(n, k)
isn't the number ofk
--element subsets of ann
--element set. For example,choose(-4,2) == 10
. – Walkout