Using group by to get the value corresponding to the max value of another column
Asked Answered
M

1

7

In my dataset respondents are grouped together and there is data available about their age. I want all the people in the same group to have the value of the oldest person in that group.

So my example data looks like this.

      df <- data.frame(groups = c(1,1,1,2,2,2,3,3,3), 
                       age = c(12, 23, 34, 13, 24, 35, 13, 25, 36), 
                       value = c(1, 2, 3, 4, 5, 6, 7, 8, 9))

> df
  groups age value
1      1  12     1
2      1  23     2
3      1  34     3
4      2  13     4
5      2  24     5
6      2  35     6
7      3  13     7
8      3  25     8
9      3  36     9

And I want it to look this this

> df
  groups age value new_value
1      1  12     1         3
2      1  23     2         3
3      1  34     3         3
4      2  13     4         6
5      2  24     5         6
6      2  35     6         6
7      3  13     7         9
8      3  25     8         9
9      3  36     9         9

Any idea how to do this with dplyr?

I have tried something like this, but it doesn't work

df %>% 
        group_by(groups) %>% 
        mutate(new_value = df$value[which.max(df$age)])
Methylal answered 30/11, 2021 at 13:25 Comment(0)
I
11

Up front, "never" (okay, almost never) use df$ within a dplyr pipe. In this case, df$value[which.max(df$age)] is referencing the original data each time, not the grouped data. Inside each group in this dataset, value is length 3 whereas df$value is length 9.

The only times I feel it is appropriate to use df$ (referencing the original value of the current dataset) inside a pipe is when it is required to look at pre-pipeline data, in absence of any grouping, reordering, or new variables created outside of the currently-saved (pre-pipeline) version of df.

dplyr

library(dplyr)
df %>%
  group_by(groups) %>%
  mutate(new_value = value[which.max(age)]) %>%
  ungroup()
# # A tibble: 9 x 4
#   groups   age value new_value
#    <dbl> <dbl> <dbl>     <dbl>
# 1      1    12     1         3
# 2      1    23     2         3
# 3      1    34     3         3
# 4      2    13     4         6
# 5      2    24     5         6
# 6      2    35     6         6
# 7      3    13     7         9
# 8      3    25     8         9
# 9      3    36     9         9

data.table

library(data.table)
DT <- as.data.table(df)
DT[, new_value := value[which.max(age)], by = .(groups)]

base R

df$new_value <- ave(seq_len(nrow(df)), df$groups,
                    FUN = function(i) df$value[i][which.max(df$age[i])])
df
#   groups age value new_value
# 1      1  12     1         3
# 2      1  23     2         3
# 3      1  34     3         3
# 4      2  13     4         6
# 5      2  24     5         6
# 6      2  35     6         6
# 7      3  13     7         9
# 8      3  25     8         9
# 9      3  36     9         9

The base R approach seems to be the least-elegant-looking solution. I believe that ave is the best approach, but it has many limitations, first being that it only works on one value-object (value) in the absence of others (we need to know age).

Incorporating answered 30/11, 2021 at 13:27 Comment(2)
Wow that was quick, thanks! Looks like I was really close with my own solution... The dplyr method works, but is really slow. Data.table is much quicker.Methylal
For many things, you will likely find that data.table is much faster than similar base or dplyr solutions; I tend to prefer data.table for its speed and referential semantics, but its learning curve is non-trivial. See my top-note edit about why your approach was close but flawed. Glad it helped.Incorporating

© 2022 - 2024 — McMap. All rights reserved.