What's the biggest R-gotcha you've run across?

P

29

58

Is there a certain R-gotcha that had you really surprised one day? I think we'd all gain from sharing these.

Here's mine: in list indexing, my.list[[1]] is not my.list[1]. Learned this in the early days of R.

Pict answered 8/10, 2009 at 0:48 Comment(3)

There are a lot more gotchas, big and small, in 'The R Inferno' burns-stat.com/pages/Tutor/R_inferno.pdf – Hornsby 7/11, 2011 at 20:30

Whitespace matters in if-else statement. Error: unexpected 'else' in "else" will pop out when you put a newline after the curly brace in the if statement: if { ... } \n else { ... }. – Alevin 15/2, 2014 at 16:25

The choose function. choose(n, k) isn't the number of k--element subsets of an n--element set. For example, choose(-4,2) == 10. – Walkout 25/5, 2015 at 4:12

P

32

Removing rows in a dataframe will cause non-uniquely named rows to be added, which then errors out:

> a<-data.frame(c(1,2,3,4),c(4,3,2,1))
> a<-a[-3,]
> a
  c.1..2..3..4. c.4..3..2..1.
1             1             4
2             2             3
4             4             1
> a[4,1]<-1
> a
Error in data.frame(c.1..2..3..4. = c("1", "2", "4", "1"), c.4..3..2..1. = c(" 4",  : 
  duplicate row.names: 4

So what is going on here is:

A four row data.frame is created, so the rownames are c(1,2,3,4)
The third row is deleted, so the rownames are c(1,2,4)
A fourth row is added, and R automatically sets the row name equal to the index i.e. 4, so the row names are c(1,2,4,4). This is illegal because row names should be unique. I don't see why this type of behavior should be allowed by R. It seems to me that R should provide a unique row name.

Pluviometer answered 8/10, 2009 at 3:15 Comment(12)

Interesting. I've been using R and its S predecessors since 1988 and I'd never seen that before! – Ger 8/10, 2009 at 9:24

Wow. That is very strange. Can you explain it? – Thunderstone 8/10, 2009 at 10:49

So what is going on here is: 1. A four row data.frame is created, so the rownames are c(1,2,3,4) 2. The third row is deleted, so the rownames are c(1,2,4) 3. A fourth row is added, and R automatically sets the row name equal to the index i.e. 4, so the row names are c(1,2,4,4). This is illegal because row names should be unique. I don't see why this type of behavior should be allowed by R. It seems to me that R should provide a unique row name. – Pluviometer 8/10, 2009 at 15:13

Very interesting. Two thoughts: (1) it might be clearer in the long run to edit your answer and add your explanation there and (2) have you considered emailing this into the r-devel mail list? – Thunderstone 8/10, 2009 at 15:29

note that this is an error of print.data.frame. The code will run fine otherwise (with warnings.) – Unreasonable 8/10, 2009 at 16:9

I suppose that I could ask r-devel, but it might get shot down with prejudice by some of the stronger personalities there. From a performance perspective, checking for uniqueness is O(n), so that might be the reason. If someone else thinks that it should go to the developer list, I'll send it. – Pluviometer 8/10, 2009 at 16:13

O(n) worst case scenario. O(n) isn't that bad... I would send it to r-devel. – Pict 12/10, 2009 at 22:49

@Eduardo: When you do a traceback after the error is is thrown by the data.frame function – Margrettmarguerie 27/5, 2011 at 15:2

This is great and yet another reason to avoid data frames, if possible. I wonder what is gained with row names, given all of the issues that crop up. – Gonnella 28/9, 2011 at 21:10

@IanFellows just curious -- did you end up sending it to r-devel? I wouldn't blame you if you didn't. – Waters 24/10, 2011 at 22:49

Seems fixed in R 3.3.3. Final row is now 4.1 1 NA – Impeccant 28/3, 2017 at 22:34

@NickKennedy, my faith in humanity is restored! ~8 years after I reported it to r-devel. – Pluviometer 30/3, 2017 at 15:42

T

43

[Hadley pointed this out in a comment.]

When using a sequence as an index for iteration, it's better to use the seq_along() function rather than something like 1:length(x).

Here I create a vector and both approaches return the same thing:

> x <- 1:10
> 1:length(x)
 [1]  1  2  3  4  5  6  7  8  9 10
> seq_along(x)
 [1]  1  2  3  4  5  6  7  8  9 10

Now make the vector NULL:

> x <- NULL
> seq_along(x) # returns an empty integer; good behavior
integer(0)
> 1:length(x) # wraps around and returns a sequence; this is bad
[1] 1 0

This can cause some confusion in a loop:

> for(i in 1:length(x)) print(i)
[1] 1
[1] 0
> for(i in seq_along(x)) print(i)
>

Thunderstone answered 23/6, 2010 at 14:22 Comment(0)

O

36

The automatic creation of factors when you load data. You unthinkingly treat a column in a data frame as characters, and this works well until you do something like trying to change a value to one that isn't a level. This will generate a warning but leave your data frame with NA's in it ...

When something goes unexpectedly wrong in your R script, check that factors aren't to blame.

Offutt answered 8/10, 2009 at 2:54 Comment(3)

Right -- but you can useoptions("stringsAsFactors"=FALSE) in your startup file(s) to change this. – Entrenchment 8/10, 2009 at 3:9

@Dirk all well until you send a piece of code to someone with a different .Rprofile (happened to me this week ;) – Langton 12/8, 2011 at 22:21

This actually happens not only when you read a file, but also when you use the data.frame constructor. Has bit me from behind many times as well. – Revile 19/12, 2012 at 20:30

E

32

Forgetting the drop=FALSE argument in subsetting matrices down to single dimension and thereby dropping the object class as well:

R> X <- matrix(1:4,2)
R> X
     [,1] [,2]
[1,]    1    3
[2,]    2    4
R> class(X)
[1] "matrix"
R> X[,1]
[1] 1 2
R> class(X[,1])
[1] "integer"
R> X[,1, drop=FALSE]
     [,1]
[1,]    1
[2,]    2
R> class(X[,1, drop=FALSE])
[1] "matrix"
R>

Entrenchment answered 8/10, 2009 at 2:2 Comment(0)

P

32

Removing rows in a dataframe will cause non-uniquely named rows to be added, which then errors out:

> a<-data.frame(c(1,2,3,4),c(4,3,2,1))
> a<-a[-3,]
> a
  c.1..2..3..4. c.4..3..2..1.
1             1             4
2             2             3
4             4             1
> a[4,1]<-1
> a
Error in data.frame(c.1..2..3..4. = c("1", "2", "4", "1"), c.4..3..2..1. = c(" 4",  : 
  duplicate row.names: 4

So what is going on here is:

A four row data.frame is created, so the rownames are c(1,2,3,4)
The third row is deleted, so the rownames are c(1,2,4)
A fourth row is added, and R automatically sets the row name equal to the index i.e. 4, so the row names are c(1,2,4,4). This is illegal because row names should be unique. I don't see why this type of behavior should be allowed by R. It seems to me that R should provide a unique row name.