Subsetting R data frame results in mysterious NA rows
Asked Answered
F

7

83

I've been encountering what I think is a bug. It's not a big deal, but I'm curious if anyone else has seen this. Unfortunately, my data is confidential, so I have to make up an example, and it's not going to be very helpful.

When subsetting my data, I occassionally get mysterious NA rows that aren't in my original data frame. Even the rownames are NA. EG:

example <- data.frame("var1"=c("A", "B", "A"), "var2"=c("X", "Y", "Z"))
example

  var1 var2
1    A    X
2    B    Y
3    A    Z

then I run:

example[example$var1=="A",]

  var1 var2
1    A    X
3    A    Z
NA<NA> <NA>

Of course, the example above does not actually give you this mysterious NA row; I am adding it here to illustrate the problem I'm having with my data.

Maybe it has to do with the fact that I'm importing my original data set using Google's read.xlsx package and then executing wide to long reshape before subsetting.

Thanks

Fusee answered 10/1, 2013 at 15:42 Comment(7)
While it's impossible to be sure without seeing your data, the problem is almost certainly that some of your indices are greater than the number of rows are in the data. For example, try example[c(1, 2, 4),] or example[c(TRUE, TRUE, FALSE, TRUE),] using your data frame above. Check the length (if it's boolean) and the maximum (if it's numeric) of the vector you are using to subset the rows.Ezarras
...and/or some of your indices are NA themselves.Perfidious
As David said, we need to know more... but looking at str(yourdata) and summary(yourdata) will help you out a lot. I have a feeling you have at least one NA in your var column. Test it: example <- data.frame("var1"=c("A", "B", "A", NA), "var2"=c("Q", "X", "Y", "Z")); example[example$var=='A',]Matter
If your code is analogous to this example (of the form d[d$v == x, ], your problem is indeed almost certainly NA`s in your column.Ezarras
Answered! I have NAs in the index column. I can't believe I've never come across this before. It's funny to me that R "censors" the data in other columns with NAs (even the row name!) when you hit an NA in your index column. I'm new to posting on StackOverflow so it will take me a minute to figure out how to designate this question answered.Fusee
Well I can't figure out how to designate this "answered", so maybe I made that part up. Thanks for the helpFusee
@Fusee enter an answer yourself.Midi
J
82

Wrap the condition in which:

df[which(df$number1 < df$number2), ]

How it works:

It returns the row numbers where the condition matches (where the condition is TRUE) and subsets the data frame on those rows accordingly.

Say that:

which(df$number1 < df$number2)

returns row numbers 1, 2, 3, 4 and 5.

As such, writing:

df[which(df$number1 < df$number2), ]

is the same as writing:

df[c(1, 2, 3, 4, 5), ]

Or an even simpler version is:

df[1:5, ]
Juanajuanita answered 20/5, 2016 at 16:17 Comment(0)
F
39

I see this was already answered by the OP, but since his comment is buried deep within the comment section, here's my attempt to fix this issue (at least with my data, which was behaving the same way).

First of all, some sample data:

> df <- data.frame(name = LETTERS[1:10], number1 = 1:10, number2 = c(10:3, NA, NA))
> df
   name number1 number2
1     A       1      10
2     B       2       9
3     C       3       8
4     D       4       7
5     E       5       6
6     F       6       5
7     G       7       4
8     H       8       3
9     I       9      NA
10    J      10      NA

Now for a simple filter:

> df[df$number1 < df$number2, ]
     name number1 number2
1       A       1      10
2       B       2       9
3       C       3       8
4       D       4       7
5       E       5       6
NA   <NA>      NA      NA
NA.1 <NA>      NA      NA

The problem here is that the presence of NAs in the third column causes R to rewrite the whole row as NA. Nonetheless, the data frame dimensions are maintained. Here's my fix, which requires knowledge of which column contains the NAs:

> df[df$number1 < df$number2 & !is.na(df$number2), ]
  name number1 number2
1    A       1      10
2    B       2       9
3    C       3       8
4    D       4       7
5    E       5       6
Footer answered 15/12, 2014 at 18:56 Comment(5)
This is how I've always dealt with this issue, but is there a way to combine the !is.na and < into one command?Alcides
@Nova, I don't think so, since they are two distinct logical tests. I'd love to be proven wrong, though.Footer
Answered above, the which() function may fit that role but it's less than satisfactory. I strongly believe this to be a bug imho and it's unfortunate that this "feature" (NA selection craziness) won't be fixed.Jugate
This is so helpful for understanding why this happens to me all the time. I agree with others that this is a bug. Hopefully someone on the R Core Team also agrees.Stenosis
@colin, I am not so sure it's a bug, nowadays I'd simply call it a result of the design philosophy behind R of not discarding NA values by default. Instead, what R usually does is "oh, there's an NA on this vector, so I'm just going to have the whole thing show as NA because I don't know what the value of NA stands for and how it affects the rest of the vector). Take for instance mean(c(1, 3, NA)). R will print NA because it doesn't know what the third value is, so it can't really tell you what the mean is. If the user wants to drop the NA, they have to explicitly set na.rm=TRUE.Footer
I
14

I get the same problem when using code similar to what you posted. Using the function subset()

subset(example,example$var1=="A")

the NA row instead gets excluded.

Icelandic answered 7/5, 2014 at 13:58 Comment(1)
This is helpful, but please beware of the potential problems of using subset anywhere other than in an interactive R session. From the function's help page: "This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences."Footer
E
6

Using dplyr:

library(dplyr)
filter(df, number1 < number2)
Ebracteate answered 28/2, 2017 at 22:42 Comment(1)
Indeed that library doesn't suffer from that NA affliction.Jugate
K
5

I find using %in$ instead of == can solve this issue although I am still wondering why. For example, instead of: df[df$num == 1,] use: df[df$num %in% c(1),] will work.

Krakow answered 5/5, 2020 at 7:42 Comment(1)
2020, working in R 3.6.3 and using df[df$col1 %in% c("Whatever"), ] bring me a table with no empty NA indexed rows. Whereas using the equal symbol like this: df[df$col1 == "Whatever", ] bring back the initial problem. A filtered table with empty rows indexed with NA.Ague
B
1
   > example <- data.frame("var1"=c("A", NA, "A"), "var2"=c("X", "Y", "Z"))
    > example
      var1 var2
    1    A    X
    2 <NA>    Y
    3    A    Z
    > example[example$var1=="A",]
       var1 var2
    1     A    X
    NA <NA> <NA>
    3     A    Z

Probably this must be your result u are expecting...Try this try using which condition before condition to avoid NA's

  example[which(example$var1=="A"),]
      var1 var2
    1    A    X
    3    A    Z
Burnard answered 12/10, 2016 at 9:54 Comment(0)
K
0

Another cause may be that you get the condition wrong, such as checking if a factor column is equal to a value that is not among its levels. Troubled me for a while.

Kv answered 20/9, 2016 at 11:10 Comment(1)
Dear downvoters, please explain the reason for downvoting, thanks!Calcaneus

© 2022 - 2024 — McMap. All rights reserved.