Select the first and last row by group in a data frame
Asked Answered
F

6

59

How can I select the first and last row for each unique id in the following dataframe?

tmp <- structure(list(id = c(15L, 15L, 15L, 15L, 21L, 21L, 22L, 22L, 
22L, 23L, 23L, 23L, 24L, 24L, 24L, 24L), d = c(1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), gr = c(2L, 1L, 
1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L), mm = c(3.4, 
4.9, 4.4, 5.5, 4, 3.8, 4, 4.9, 4.6, 2.7, 4, 3, 3, 2, 4, 2), area = c(1L, 
2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 3L)), .Names = c("id", 
"d", "gr", "mm", "area"), class = "data.frame", row.names = c(NA, 
-16L))
tmp
#>    id d gr  mm area
#> 1  15 1  2 3.4    1
#> 2  15 1  1 4.9    2
#> 3  15 1  1 4.4    1
#> 4  15 1  1 5.5    2
#> 5  21 1  1 4.0    2
#> 6  21 1  2 3.8    2
#> 7  22 1  1 4.0    2
#> 8  22 1  1 4.9    2
#> 9  22 1  2 4.6    2
#> 10 23 1  1 2.7    2
#> 11 23 1  1 4.0    2
#> 12 23 1  2 3.0    2
#> 13 24 1  1 3.0    2
#> 14 24 1  1 2.0    3
#> 15 24 1  1 4.0    2
#> 16 24 1  2 2.0    3
Flan answered 20/11, 2011 at 18:52 Comment(0)
D
47

A plyr solution (tmp is your data frame):

library("plyr")
ddply(tmp, .(id), function(x) x[c(1, nrow(x)), ])
#    id d gr  mm area
# 1  15 1  2 3.4    1
# 2  15 1  1 5.5    2
# 3  21 1  1 4.0    2
# 4  21 1  2 3.8    2
# 5  22 1  1 4.0    2
# 6  22 1  2 4.6    2
# 7  23 1  1 2.7    2
# 8  23 1  2 3.0    2
# 9  24 1  1 3.0    2
# 10 24 1  2 2.0    3

Or with dplyr (see also here):

library("dplyr")
tmp %>%
group_by(id) %>%
slice(c(1, n())) %>%
ungroup()
# # A tibble: 10 × 5
#       id     d    gr    mm  area
#    <int> <int> <int> <dbl> <int>
# 1     15     1     2   3.4     1
# 2     15     1     1   5.5     2
# 3     21     1     1   4.0     2
# 4     21     1     2   3.8     2
# 5     22     1     1   4.0     2
# 6     22     1     2   4.6     2
# 7     23     1     1   2.7     2
# 8     23     1     2   3.0     2
# 9     24     1     1   3.0     2
# 10    24     1     2   2.0     3
Dedradedric answered 20/11, 2011 at 19:4 Comment(0)
D
124

A fast and short data.table solution :

tmp[, .SD[c(1,.N)], by=id]

where .SD represents each (S)ubset of (D)ata, .N is the number of rows in each group and tmp is a data.table; e.g. as provided by fread() by default or by converting a data.frame using setDT().

Note that if a group only contains one row, that row will appear twice in the output because that row is both the first and last row of that group. To avoid the repetition in that case, thanks to @Thell:

tmp[, .SD[unique(c(1,.N))], by=id]

Alternatively, the following makes the logic explicit for the .N==1 special case :

tmp[, if (.N==1) .SD else .SD[c(1,.N)], by=id]

You don't need .SD[1] in the first part of the if because in that case .N is 1 so .SD must be just one row anyway.

You can wrap j in {} and have a whole page of code inside {} if you like. Just as long as the last expression inside {} returns a list- like object to be stacked (such as a plain list, data.table or data.frame).

tmp[, { ...; if (.N==1) .SD else .SD[c(1,.N)] } , by=id]
Dictaphone answered 21/11, 2011 at 13:42 Comment(7)
.SD[unique(c(1,.N))] for when a group has a single member.E
borrowing from data camp setkey(tmp,id) and tmp[, .SD[c(1,.N)], .EACHI]Diaphoresis
@Diaphoresis I have to admit I didn't know setkey(tmp,id); tmp[, .SD[c(1,.N)], .EACHI] would work without any i present. Where exactly is it on DataCamp? Thanks.Dictaphone
@MattDowle it can be found on Chapter 3 > exercise: Selecting groups or parts of groups: Use by = .EACHI and .SD to select the first and last row of the "b" and "c" groups. and in the exercise DT[c("b","c"), .SD[c(1,.N)], by= .EACHI] # The first row of the "b" and "c" groups great course btwQuadruplet
@Quadruplet But there i is present. Is it on DataCamp course without i present?Dictaphone
Another method would be tmp[ tmp[, .I[c(1, .N)], by = id]$V1 ] , and you can also use .I[unique(c(1, .N))] for duplicatesHeadset
@Headset Using .I for this is getting a bit complicated and for large groups .I would be constructed wastefully only to use the first and last from it.Dictaphone
D
47

A plyr solution (tmp is your data frame):

library("plyr")
ddply(tmp, .(id), function(x) x[c(1, nrow(x)), ])
#    id d gr  mm area
# 1  15 1  2 3.4    1
# 2  15 1  1 5.5    2
# 3  21 1  1 4.0    2
# 4  21 1  2 3.8    2
# 5  22 1  1 4.0    2
# 6  22 1  2 4.6    2
# 7  23 1  1 2.7    2
# 8  23 1  2 3.0    2
# 9  24 1  1 3.0    2
# 10 24 1  2 2.0    3

Or with dplyr (see also here):

library("dplyr")
tmp %>%
group_by(id) %>%
slice(c(1, n())) %>%
ungroup()
# # A tibble: 10 × 5
#       id     d    gr    mm  area
#    <int> <int> <int> <dbl> <int>
# 1     15     1     2   3.4     1
# 2     15     1     1   5.5     2
# 3     21     1     1   4.0     2
# 4     21     1     2   3.8     2
# 5     22     1     1   4.0     2
# 6     22     1     2   4.6     2
# 7     23     1     1   2.7     2
# 8     23     1     2   3.0     2
# 9     24     1     1   3.0     2
# 10    24     1     2   2.0     3
Dedradedric answered 20/11, 2011 at 19:4 Comment(0)
R
7

Here is a solution in base R. If there are multiple groups with the same id this code returns the first and last row for each of those individual groups.

EDIT: January 12, 2017

This solution might be a little more intuitive than my other answer farther below:

lmy.df = read.table(text = '
     id    d    gr     mm  area
     15    1     2   3.40     1
     15    1     1   4.90     2
     15    1     1   4.40     1
     15    1     1   5.50     2
     21    1     1   4.00     2
     21    1     2   3.80     2
     22    1     1   4.00     2
     23    1     1   2.70     2
     23    1     1   4.00     2
     23    1     2   3.00     2
     24    1     1   3.00     2
     24    1     1   2.00     3
     24    1     1   4.00     2
     24    1     2   2.00     3
', header = TRUE)

head <- aggregate(lmy.df, by=list(lmy.df$id), FUN = function(x) { first = head(x,1) } )
tail <- aggregate(lmy.df, by=list(lmy.df$id), FUN = function(x) {  last = tail(x,1) } )
head$order = 'first'
tail$order = 'last'

my.output <- rbind(head, tail)
my.output
#   Group.1 id d gr  mm area order
#1       15 15 1  2 3.4    1 first
#2       21 21 1  1 4.0    2 first
#3       22 22 1  1 4.0    2 first
#4       23 23 1  1 2.7    2 first
#5       24 24 1  1 3.0    2 first
#6       15 15 1  1 5.5    2  last
#7       21 21 1  2 3.8    2  last
#8       22 22 1  1 4.0    2  last
#9       23 23 1  2 3.0    2  last
#10      24 24 1  2 2.0    3  last

EDIT: June 18, 2016

Since posting my original answer I have learned it is better to use lapply than apply. This is because apply does not work if every group has the same number of rows. See here: Error when numbering rows by group

lmy.df = read.table(text = '
     id    d    gr     mm  area
     15    1     2   3.40     1
     15    1     1   4.90     2
     15    1     1   4.40     1
     15    1     1   5.50     2
     21    1     1   4.00     2
     21    1     2   3.80     2
     22    1     1   4.00     2
     23    1     1   2.70     2
     23    1     1   4.00     2
     23    1     2   3.00     2
     24    1     1   3.00     2
     24    1     1   2.00     3
     24    1     1   4.00     2
     24    1     2   2.00     3
', header = TRUE)


lmy.seq <- rle(lmy.df$id)$lengths
lmy.df$first <- unlist(lapply(lmy.seq, function(x) seq(1,x)))
lmy.df$last  <- unlist(lapply(lmy.seq, function(x) seq(x,1,-1)))
lmy.df

lmy.df2 <- lmy.df[lmy.df$first==1 | lmy.df$last == 1,]
lmy.df2

#   id d gr  mm area first last
#1  15 1  2 3.4    1     1    4
#4  15 1  1 5.5    2     4    1
#5  21 1  1 4.0    2     1    2
#6  21 1  2 3.8    2     2    1
#7  22 1  1 4.0    2     1    1
#8  23 1  1 2.7    2     1    3
#10 23 1  2 3.0    2     3    1
#11 24 1  1 3.0    2     1    4
#14 24 1  2 2.0    3     4    1

Here is an example in which each group has two rows:

lmy.df = read.table(text = '
     id    d    gr     mm  area
     15    1     2   3.40     1
     15    1     1   4.90     2
     21    1     1   4.00     2
     21    1     2   3.80     2
     22    1     1   4.00     2
     22    1     1   6.00     2
     23    1     1   2.70     2
     23    1     2   3.00     2
     24    1     1   3.00     2
     24    1     2   2.00     3
', header = TRUE)

lmy.seq <- rle(lmy.df$id)$lengths

lmy.df$first <- unlist(lapply(lmy.seq, function(x) seq(1,x)))
lmy.df$last  <- unlist(lapply(lmy.seq, function(x) seq(x,1,-1)))
lmy.df

lmy.df2 <- lmy.df[lmy.df$first==1 | lmy.df$last == 1,]
lmy.df2

#   id d gr  mm area first last
#1  15 1  2 3.4    1     1    2
#2  15 1  1 4.9    2     2    1
#3  21 1  1 4.0    2     1    2
#4  21 1  2 3.8    2     2    1
#5  22 1  1 4.0    2     1    2
#6  22 1  1 6.0    2     2    1
#7  23 1  1 2.7    2     1    2
#8  23 1  2 3.0    2     2    1
#9  24 1  1 3.0    2     1    2
#10 24 1  2 2.0    3     2    1

Original answer:

my.seq <- data.frame(rle(my.df$id)$lengths)

my.df$first <- unlist(apply(my.seq, 1, function(x) seq(1,x)))
my.df$last  <- unlist(apply(my.seq, 1, function(x) seq(x,1,-1)))

my.df2 <- my.df[my.df$first==1 | my.df$last == 1,]
my.df2

   id d gr  mm area first last
1  15 1  2 3.4    1     1    4
4  15 1  1 5.5    2     4    1
5  21 1  1 4.0    2     1    2
6  21 1  2 3.8    2     2    1
7  22 1  1 4.0    2     1    3
9  22 1  2 4.6    2     3    1
10 23 1  1 2.7    2     1    3
12 23 1  2 3.0    2     3    1
13 24 1  1 3.0    2     1    4
16 24 1  2 2.0    3     4    1
Russelrussell answered 26/8, 2013 at 15:12 Comment(0)
A
2

use slice_head() and slice_tail()

library(tidyverse)

tmp <- structure(list(id = c(15L, 15L, 15L, 15L, 21L, 21L, 22L, 22L, 
22L, 23L, 23L, 23L, 24L, 24L, 24L, 24L), d = c(1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), gr = c(2L, 1L, 
1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L), mm = c(3.4, 
4.9, 4.4, 5.5, 4, 3.8, 4, 4.9, 4.6, 2.7, 4, 3, 3, 2, 4, 2), area = c(1L, 
2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 3L)), class = "data.frame", row.names = c(NA, 
-16L))

tmp %>%
  group_by(id) %>%
  slice_head()
# A tibble: 5 x 5
# Groups:   id [5]
     id     d    gr    mm  area
  <int> <int> <int> <dbl> <int>
1    15     1     2   3.4     1
2    21     1     1   4       2
3    22     1     1   4       2
4    23     1     1   2.7     2
5    24     1     1   3       2
tmp %>%
  group_by(id) %>%
  slice_tail()
# A tibble: 5 x 5
# Groups:   id [5]
     id     d    gr    mm  area
  <int> <int> <int> <dbl> <int>
1    15     1     1   5.5     2
2    21     1     2   3.8     2
3    22     1     2   4.6     2
4    23     1     2   3       2
5    24     1     2   2       3

Note that:

By default, slice_head() and slice_tail() return 1 row, but you can also specify the arguments n and prop with slice a number of rows or a proportion of rows respectively. See ?slice for more details.

Adrea answered 5/11, 2020 at 17:51 Comment(0)
J
1

Another approach utilizing dplyr could be:

tmp %>%
 group_by(id) %>%
 filter(1:n() %in% range(1:n()))

      id     d    gr    mm  area
   <int> <int> <int> <dbl> <int>
 1    15     1     2   3.4     1
 2    15     1     1   5.5     2
 3    21     1     1   4       2
 4    21     1     2   3.8     2
 5    22     1     1   4       2
 6    22     1     2   4.6     2
 7    23     1     1   2.7     2
 8    23     1     2   3       2
 9    24     1     1   3       2
10    24     1     2   2       3

Or the same idea with using row_number():

tmp %>%
 group_by(id) %>%
 filter(row_number() %in% range(row_number()))

Or performing the operation with slice():

tmp %>%
 group_by(id) %>%
 slice(c(which.min(1:n()), which.max(1:n())))
Jardena answered 28/2, 2020 at 7:57 Comment(2)
I think you'd only need which.min() and which.max() if you were looking for the top and bottom row of an unsorted data frame by a specific column. In this case, where it's just first and last row you can use slice(1, n())Ahola
@Gregor Thomas that is certainly true, I've just provided some variations on the existing themes :) For some users it may be more meaningful or straightforward.Jardena
J
0

We can also use ave in base R. For each id we select the first and last row.

tmp[as.logical(with(tmp,ave(d, id, FUN = function(x) 
                    seq_along(x) %in% c(1L, length(x))))), ]

#   id d gr  mm area
#1  15 1  2 3.4    1
#4  15 1  1 5.5    2
#5  21 1  1 4.0    2
#6  21 1  2 3.8    2
#7  22 1  1 4.0    2
#9  22 1  2 4.6    2
#10 23 1  1 2.7    2
#12 23 1  2 3.0    2
#13 24 1  1 3.0    2
#16 24 1  2 2.0    3

A shorter version would be using range, range returns minimum and maximum value from the vector

tmp[as.logical(with(tmp, ave(seq_along(d), id,FUN = function(x) x %in% range(x)))),]

We can also use split + sapply approach with range

tmp[c(sapply(split(seq_len(nrow(tmp)), tmp$id), range)), ]

Using dplyr, although I would prefer the slice approach shown by @rcs but here is one way using filter which is similar to ave solution where we create a logical vector by comparing row_number()

library(dplyr)
tmp %>% group_by(id) %>% filter(row_number() %in% c(1L, n()))

In all the above solution, we can also use match instead of %in% as %in% is just a wrapper around match.

Janice answered 27/8, 2019 at 5:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.