R - Keep first observation per group identified by multiple variables (Stata equivalent "bys var1 var2 : keep if _n == 1")

M

3

14

So I currently face a problem in R that I exactly know how to deal with in Stata, but have wasted over two hours to accomplish in R.

Using the data.frame below, the result I want is to obtain exactly the first observation per group, while groups are formed by multiple variables and have to be sorted by another variable, i.e. the data.frame mydata obtained by:

id <- c(1,1,1,1,2,2,3,3,4,4,4)
day <- c(1,1,2,3,1,2,2,3,1,2,3)
value <- c(12,10,15,20,40,30,22,24,11,11,12)
mydata <- data.frame(id, day, value)

Should be transformed to:

   id day value   
   1   1    10 
   1   2    15 
   1   3    20 
   2   1    40 
   2   2    30 
   3   2    22 
   3   3    24 
   4   1    11 
   4   2    11 
   4   3    12

By keeping only one of the rows with one or multiple duplicate group-identificators (here that is only row[1]: (id,day)=(1,1)), sorting for value first (so that the row with the lowest value is kept).

In Stata, this would simply be:

bys id day (value): keep if _n == 1

I found a piece of code on the web, which properly does that if I first produce a single group identifier :

mydata$id1 <- paste(mydata$id,"000",mydata$day, sep="")  ### the single group identifier

myid.uni <- unique(mydata$id1)
a<-length(myid.uni)

last <- c()

for (i in 1:a) {
  temp<-subset(mydata, id1==myid.uni[i])
  if (dim(temp)[1] > 1) {
    last.temp<-temp[dim(temp)[1],]
  }
  else {
    last.temp<-temp
  }
  last<-rbind(last, last.temp)
}

last

However, there are a few problems with this approach:
1. A single identifier needs to be created (which is quickly done).
2. It seems like a cumbersome piece of code compared to the single line of code in Stata.
3. On a medium-sized dataset (below 100,000 observations grouped in lots of about 6), this approach would take about 1.5 hours.

Is there any efficient equivalent to Stata's bys var1 var2: keep if _n == 1 ?

Myramyrah answered 4/2, 2014 at 12:58 Comment(4)

I am a Stata person, but I would not reckon to charm R users here by implying that R is lousy, even for one detailed very specific task. Better to imply that there must be a neat R way to do it, just that you have not identified it. – Dualistic 4/2, 2014 at 13:42

@NickCox, I didn't mean to be insulting towards R, if that is what it sounds like. I've been in the process of migrating from Stata for the last two weeks and am still struggling to get the basic functionality right. – Myramyrah 4/2, 2014 at 13:47

For starters with your migration, while it is possible to use for loops effectively in R, if it's your first resort, it might not be the most effective way to go. – Abana 4/2, 2014 at 15:23

@Myramyrah No doubt; I am just advising on how to influence people and encourage them to answer your question. – Dualistic 4/2, 2014 at 15:43

A

7

I would order the data.frame at which point you can look into using by:

mydata <- mydata[with(mydata, do.call(order, list(id, day, value))), ]

do.call(rbind, by(mydata, list(mydata$id, mydata$day), 
                  FUN=function(x) head(x, 1)))

Alternatively, look into the "data.table" package. Continuing with the ordered data.frame from above:

library(data.table)

DT <- data.table(mydata, key = "id,day")
DT[, head(.SD, 1), by = key(DT)]
#     id day value
#  1:  1   1    10
#  2:  1   2    15
#  3:  1   3    20
#  4:  2   1    40
#  5:  2   2    30
#  6:  3   2    22
#  7:  3   3    24
#  8:  4   1    11
#  9:  4   2    11
# 10:  4   3    12

Or, starting from scratch, you can use data.table in the following way:

DT <- data.table(id, day, value, key = "id,day")
DT[, n := rank(value, ties.method="first"), by = key(DT)][n == 1]

And, by extension, in base R:

Ranks <- with(mydata, ave(value, id, day, FUN = function(x) 
  rank(x, ties.method="first")))
mydata[Ranks == 1, ]

Abana answered 4/2, 2014 at 13:13 Comment(7)

The alternative using by: does the job, but also takes as long as the piece of code I originally posted. I'll have to have the IT-guys install data.table to look at that, I'll get back to you if that works better later. – Myramyrah 4/2, 2014 at 13:23

@iraserd, I also find it hard to believe that any of these approaches would take 1.5 hours. Can you add more information about your data? – Abana 4/2, 2014 at 13:27

Well, I aborted the for loop approach I originally posted after about 15 minutes and by then it had processed about 12k of 80k observations. – Myramyrah 4/2, 2014 at 13:34

@Anada Mahto, alright, so for a small subsample of my dataset, all 3 of your alternatives work perfectly fine. However, on the full dataset (about 500k observations), by() and DT[] seem to take significantly longer than ave(). Ave() just finished after about 5 min, whereas I aborted by() and DT[] after about 10 minutes each. Ave() thus was the right answer for me, thank you! – Myramyrah 4/2, 2014 at 14:5

@iraserd, I find that surprising. The "data.table" package should be very efficient. How many unique combinations of "id" variables do you have? In my test of a 500k data.frame, I'm getting ~ 1 minute for the "data.table" approach, and two minutes for ave. – Abana 4/2, 2014 at 15:58

my data.frame has ~590000 obs of 75 variables, where each id group should have one observation per day, for 4 days total. But there are about 536000 obs. of ids with 4 days, about 50000 ids with 8 obs on these 4 days, and a few hundred more with up to a maximum of 140 obs per id. With test$idgroup <- tapply(test$id, test$id, function(x) seq(1,length(x),1)) to number and sort the observations per id, I use your ave() block to keep only the first observation (sorting by idgroup) per id and day. Here the Ranks <- ... line takes 1 min 15 to run. – Myramyrah 5/2, 2014 at 14:58

using your first DT[] approach instead, I just aborted after 10 minutes. I cannot use your second DT[] one, because it returns an Error: some columns are not in the data.table: id, time. Same for the by() approach: I aborted after 15 minutes just because it is taking much too long already and the dataset will get much larger over time. – Myramyrah 5/2, 2014 at 14:58

P

15

The package dplyr makes this kind of things easier.

library(dplyr)
mydata %>% group_by(id, day) %>% filter(row_number(value) == 1)

Note that this command requires more memory in R than in Stata: in R, a new copy of the dataset is created while in Stata, rows are deleted in place.

Prompt answered 22/4, 2015 at 12:49 Comment(2)

What is that %>% ? Results in Error: could not find function "%>%" with library(dplyr) loaded. Is it because I still run 3.0.2? – Myramyrah 3/6, 2015 at 12:17

Probably. It puts the object on the left as the first argument of the function to the right. the first line is equivalent to slice(arrange(group_by(mydata, id, day), value), 1) – Prompt 3/6, 2015 at 13:7

A

7

I would order the data.frame at which point you can look into using by:

mydata <- mydata[with(mydata, do.call(order, list(id, day, value))), ]

do.call(rbind, by(mydata, list(mydata$id, mydata$day), 
                  FUN=function(x) head(x, 1)))

Alternatively, look into the "data.table" package. Continuing with the ordered data.frame from above:

library(data.table)

DT <- data.table(mydata, key = "id,day")
DT[, head(.SD, 1), by = key(DT)]
#     id day value
#  1:  1   1    10
#  2:  1   2    15
#  3:  1   3    20
#  4:  2   1    40
#  5:  2   2    30
#  6:  3   2    22
#  7:  3   3    24
#  8:  4   1    11
#  9:  4   2    11
# 10:  4   3    12

Or, starting from scratch, you can use data.table in the following way:

DT <- data.table(id, day, value, key = "id,day")
DT[, n := rank(value, ties.method="first"), by = key(DT)][n == 1]

And, by extension, in base R:

Ranks <- with(mydata, ave(value, id, day, FUN = function(x) 
  rank(x, ties.method="first")))
mydata[Ranks == 1, ]

Abana answered 4/2, 2014 at 13:13 Comment(7)

The alternative using by: does the job, but also takes as long as the piece of code I originally posted. I'll have to have the IT-guys install data.table to look at that, I'll get back to you if that works better later. – Myramyrah 4/2, 2014 at 13:23

@iraserd, I also find it hard to believe that any of these approaches would take 1.5 hours. Can you add more information about your data? – Abana 4/2, 2014 at 13:27

Well, I aborted the for loop approach I originally posted after about 15 minutes and by then it had processed about 12k of 80k observations. – Myramyrah 4/2, 2014 at 13:34

@Anada Mahto, alright, so for a small subsample of my dataset, all 3 of your alternatives work perfectly fine. However, on the full dataset (about 500k observations), by() and DT[] seem to take significantly longer than ave(). Ave() just finished after about 5 min, whereas I aborted by() and DT[] after about 10 minutes each. Ave() thus was the right answer for me, thank you! – Myramyrah 4/2, 2014 at 14:5

@iraserd, I find that surprising. The "data.table" package should be very efficient. How many unique combinations of "id" variables do you have? In my test of a 500k data.frame, I'm getting ~ 1 minute for the "data.table" approach, and two minutes for ave. – Abana 4/2, 2014 at 15:58

my data.frame has ~590000 obs of 75 variables, where each id group should have one observation per day, for 4 days total. But there are about 536000 obs. of ids with 4 days, about 50000 ids with 8 obs on these 4 days, and a few hundred more with up to a maximum of 140 obs per id. With test$idgroup <- tapply(test$id, test$id, function(x) seq(1,length(x),1)) to number and sort the observations per id, I use your ave() block to keep only the first observation (sorting by idgroup) per id and day. Here the Ranks <- ... line takes 1 min 15 to run. – Myramyrah 5/2, 2014 at 14:58

using your first DT[] approach instead, I just aborted after 10 minutes. I cannot use your second DT[] one, because it returns an Error: some columns are not in the data.table: id, time. Same for the by() approach: I aborted after 15 minutes just because it is taking much too long already and the dataset will get much larger over time. – Myramyrah 5/2, 2014 at 14:58

I

4

Using data.table, assuming the mydata object has already been sorted in the way you require, another approach would be:

library(data.table)
mydata <- data.table(my.data)
mydata <- mydata[, .SD[1], by = .(id, day)]

Using dplyr with magrittr pipes:

library(dplyr)
mydata <- mydata %>%
  group_by(id, day) %>%
  slice(1) %>%
  ungroup()

If you don't add ungroup() to the end dplyr's grouping structure will still be present and might mess up some of your subsequent functions.

Iconography answered 14/6, 2019 at 7:32 Comment(0)

Recommended topics

Hot tags