Subsetting R data frame results in mysterious NA rows

Asked 10/1, 2013 at 15:42 Answered 5/5, 2020 at 7:42

I've been encountering what I think is a bug. It's not a big deal, but I'm curious if anyone else has seen this. Unfortunately, my data is confidential, so I have to make up an example, and it's not going to be very helpful.

When subsetting my data, I occassionally get mysterious NA rows that aren't in my original data frame. Even the rownames are NA. EG:

example <- data.frame("var1"=c("A", "B", "A"), "var2"=c("X", "Y", "Z"))
example

  var1 var2
1    A    X
2    B    Y
3    A    Z

then I run:

example[example$var1=="A",]

  var1 var2
1    A    X
3    A    Z
NA<NA> <NA>

Of course, the example above does not actually give you this mysterious NA row; I am adding it here to illustrate the problem I'm having with my data.

Maybe it has to do with the fact that I'm importing my original data set using Google's read.xlsx package and then executing wide to long reshape before subsetting.

Thanks

Fusee answered 10/1, 2013 at 15:42 Comment(7)

While it's impossible to be sure without seeing your data, the problem is almost certainly that some of your indices are greater than the number of rows are in the data. For example, try example[c(1, 2, 4),] or example[c(TRUE, TRUE, FALSE, TRUE),] using your data frame above. Check the length (if it's boolean) and the maximum (if it's numeric) of the vector you are using to subset the rows. – Ezarras 10/1, 2013 at 15:47

...and/or some of your indices are NA themselves. – Perfidious 10/1, 2013 at 15:48

As David said, we need to know more... but looking at str(yourdata) and summary(yourdata) will help you out a lot. I have a feeling you have at least one NA in your var column. Test it: example <- data.frame("var1"=c("A", "B", "A", NA), "var2"=c("Q", "X", "Y", "Z")); example[example$var=='A',] – Matter 10/1, 2013 at 15:48

If your code is analogous to this example (of the form d[d$v == x, ], your problem is indeed almost certainly NA`s in your column. – Ezarras 10/1, 2013 at 15:53

Answered! I have NAs in the index column. I can't believe I've never come across this before. It's funny to me that R "censors" the data in other columns with NAs (even the row name!) when you hit an NA in your index column. I'm new to posting on StackOverflow so it will take me a minute to figure out how to designate this question answered. – Fusee 10/1, 2013 at 17:17

Well I can't figure out how to designate this "answered", so maybe I made that part up. Thanks for the help – Fusee 10/1, 2013 at 17:25

@Fusee enter an answer yourself. – Midi 10/1, 2013 at 17:30

Wrap the condition in which:

df[which(df$number1 < df$number2), ]

How it works:

It returns the row numbers where the condition matches (where the condition is TRUE) and subsets the data frame on those rows accordingly.

Say that:

which(df$number1 < df$number2)

returns row numbers 1, 2, 3, 4 and 5.

As such, writing:

df[which(df$number1 < df$number2), ]

is the same as writing:

df[c(1, 2, 3, 4, 5), ]

Or an even simpler version is:

df[1:5, ]

Juanajuanita answered 20/5, 2016 at 16:17 Comment(0)

I see this was already answered by the OP, but since his comment is buried deep within the comment section, here's my attempt to fix this issue (at least with my data, which was behaving the same way).

First of all, some sample data:

> df <- data.frame(name = LETTERS[1:10], number1 = 1:10, number2 = c(10:3, NA, NA))
> df
   name number1 number2
1     A       1      10
2     B       2       9
3     C       3       8
4     D       4       7
5     E       5       6
6     F       6       5
7     G       7       4
8     H       8       3
9     I       9      NA
10    J      10      NA

Now for a simple filter:

> df[df$number1 < df$number2, ]
     name number1 number2
1       A       1      10
2       B       2       9
3       C       3       8
4       D       4       7
5       E       5       6
NA   <NA>      NA      NA
NA.1 <NA>      NA      NA

The problem here is that the presence of NAs in the third column causes R to rewrite the whole row as NA. Nonetheless, the data frame dimensions are maintained. Here's my fix, which requires knowledge of which column contains the NAs:

> df[df$number1 < df$number2 & !is.na(df$number2), ]
  name number1 number2
1    A       1      10
2    B       2       9
3    C       3       8
4    D       4       7
5    E       5       6

Footer answered 15/12, 2014 at 18:56 Comment(5)

This is how I've always dealt with this issue, but is there a way to combine the !is.na and < into one command? – Alcides 4/3, 2016 at 18:46

@Nova, I don't think so, since they are two distinct logical tests. I'd love to be proven wrong, though. – Footer 7/3, 2016 at 9:12

Answered above, the which() function may fit that role but it's less than satisfactory. I strongly believe this to be a bug imho and it's unfortunate that this "feature" (NA selection craziness) won't be fixed. – Jugate 31/8, 2017 at 11:31

This is so helpful for understanding why this happens to me all the time. I agree with others that this is a bug. Hopefully someone on the R Core Team also agrees. – Stenosis 5/11, 2020 at 9:35

@colin, I am not so sure it's a bug, nowadays I'd simply call it a result of the design philosophy behind R of not discarding NA values by default. Instead, what R usually does is "oh, there's an NA on this vector, so I'm just going to have the whole thing show as NA because I don't know what the value of NA stands for and how it affects the rest of the vector). Take for instance mean(c(1, 3, NA)). R will print NA because it doesn't know what the third value is, so it can't really tell you what the mean is. If the user wants to drop the NA, they have to explicitly set na.rm=TRUE. – Footer 5/11, 2020 at 14:4

I get the same problem when using code similar to what you posted. Using the function subset()

subset(example,example$var1=="A")

the NA row instead gets excluded.

Icelandic answered 7/5, 2014 at 13:58 Comment(1)

This is helpful, but please beware of the potential problems of using subset anywhere other than in an interactive R session. From the function's help page: "This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences." – Footer 15/12, 2014 at 18:58

Using dplyr:

library(dplyr)
filter(df, number1 < number2)

Ebracteate answered 28/2, 2017 at 22:42 Comment(1)

Indeed that library doesn't suffer from that NA affliction. – Jugate 31/8, 2017 at 11:24

I find using %in$ instead of == can solve this issue although I am still wondering why. For example, instead of: df[df$num == 1,] use: df[df$num %in% c(1),] will work.

Krakow answered 5/5, 2020 at 7:42 Comment(1)

2020, working in R 3.6.3 and using df[df$col1 %in% c("Whatever"), ] bring me a table with no empty NA indexed rows. Whereas using the equal symbol like this: df[df$col1 == "Whatever", ] bring back the initial problem. A filtered table with empty rows indexed with NA. – Ague 24/11, 2020 at 17:27

   > example <- data.frame("var1"=c("A", NA, "A"), "var2"=c("X", "Y", "Z"))
    > example
      var1 var2
    1    A    X
    2 <NA>    Y
    3    A    Z
    > example[example$var1=="A",]
       var1 var2
    1     A    X
    NA <NA> <NA>
    3     A    Z

Probably this must be your result u are expecting...Try this try using which condition before condition to avoid NA's

  example[which(example$var1=="A"),]
      var1 var2
    1    A    X
    3    A    Z

Burnard answered 12/10, 2016 at 9:54 Comment(0)

Another cause may be that you get the condition wrong, such as checking if a factor column is equal to a value that is not among its levels. Troubled me for a while.

Kv answered 20/9, 2016 at 11:10 Comment(1)

Dear downvoters, please explain the reason for downvoting, thanks! – Calcaneus 8/3, 2017 at 15:46

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags