Summarize with dplyr "other then" groups

Asked 6/4, 2016 at 11:48 Answered 6/4, 2016 at 22:35

I need to summarize in a grouped data_frame (warn: a solution with dplyr is very much appreciated but isn't mandatory) both something on each group (simple) and the same something on "other" groups.

minimal example

if(!require(pacman)) install.packages(pacman)
pacman::p_load(dplyr)

df <- data_frame(
    group = c('a', 'a', 'b', 'b', 'c', 'c'),
    value = c(1, 2, 3, 4, 5, 6)
)

res <- df %>%
    group_by(group) %>%
    summarize(
        median        = median(value)
#        median_other  = ... ??? ... # I need the median of all "other"
                                     # groups
#        median_before = ... ??? ... # I need the median of groups (e.g
                                 #    the "before" in alphabetic order,
                                 #    but clearly every roule which is
                                 #    a "selection function" depending
                                 #    on the actual group is fine)
    )

my expected result is the following

group    median    median_other    median_before
  a        1.5         4.5               NA
  b        3.5         3.5               1.5
  c        5.5         2.5               2.5

I've searched on Google strings similar to "dplyr summarize excluding groups", "dplyr summarize other then group",I've searched on the dplyr documentation but I wasn't able to find a solution.

here, this (How to summarize value not matching the group using dplyr) does not apply because it runs only on sum, i.e. is a solution "function-specific" (and with a simple arithmetic function that did not consider the variability on each group). What about more complex function request (i.e. mean, sd, or user-function)? :-)

Thanks to all

PS: summarize() is an example, the same question leads to mutate() or other dplyr-functions working based on groups.

Client answered 6/4, 2016 at 11:48 Comment(2)

You can't just use library(dplyr) instead of the first two lines? – Interception 6/4, 2016 at 22:50

If dplyr isn't installed on your system library(dplyr) return an error, so to be sure that anyone can run the code I had to write 2 line of code anyway and I decide to use pacman instead, which is a very usefull package in may opinion (because you can load (and install if needed) many package at the same time with just those two line of code) – Client 9/4, 2016 at 13:58

Here's my solution:

res <- df %>%
  group_by(group) %>%
  summarise(med_group = median(value),
            med_other = (median(df$value[df$group != group]))) %>% 
  mutate(med_before = lag(med_group))

> res
Source: local data frame [3 x 4]

      group med_group med_other med_before
  (chr)     (dbl)     (dbl)      (dbl)
1     a       1.5       4.5         NA
2     b       3.5       3.5        1.5
3     c       5.5       2.5        3.5

I was trying to come up with an all-dplyr solution but base R subsetting works just fine with median(df$value[df$group != group]) returning the median of all observations that are not in the current group.

I hope this help you to solve your problem.

Gules answered 6/4, 2016 at 22:35 Comment(3)

Sorry for the late response. This not really help me so much: it takes the median of the other medians not of the other value. so the issues is the same. – Client 14/4, 2016 at 18:38

suppose the c group is c(5, 6, 7). Your first med_other compute median(median(c(3, 4)), median(c(5, 6, 7))) which is different from median(3, 4, 5, 6, 7) – Client 14/4, 2016 at 18:39

@Client I adapted the answer to compute the median_other variable from the original dataset excluding the current group – Gules 15/4, 2016 at 14:58

I don't think it is in general possible to perform operations on other groups within summarise() (i.e. I think the other groups are not "visible" when summarising a certain group). You can define your own functions and use them in mutate to apply them to a certain variable. For your updated example you can use

calc_med_other <- function(x) sapply(seq_along(x), function(i) median(x[-i]))
calc_med_before <- function(x) sapply(seq_along(x), function(i) ifelse(i == 1, NA, median(x[seq(i - 1)])))

df %>%
    group_by(group) %>%
    summarize(med = median(value)) %>%
    mutate(
        med_other = calc_med_other(med),
        med_before = calc_med_before(med)
    )
#   group   med med_other med_before
#   (chr) (dbl)     (dbl)      (dbl)
#1     a   1.5       4.5         NA
#2     b   3.5       3.5        1.5
#3     c   5.5       2.5        2.5

Electromagnetic answered 6/4, 2016 at 12:6 Comment(3)

Oh, this is a very nice solution (+1) but highlights e misunderstanding (or a bad explanation in my example). i.e. the max is defined on every single group and the max(max(group1), max(group2)) is equal to max(union(group1, group2))... changing the max with the mean, or with sd, can (i hope) give a more precise idea of my questions. (I "have to" use all the information of the "others" groups to answer each row) – Client 6/4, 2016 at 20:16

with the mean it is possible to take into account the moltiplicity e recalculate the "others" mean by the mean of each "other" group and the number of elements in that group..so it is another bad example... maybe the median (or the sd as i just said) can be è good way to solve the problem (if there is such a solution)... it have to be a function of the set of the union of the "other" groups that need (some) information of that set as a "single" set. – Client 6/4, 2016 at 20:37

I've just edit the question changing max with median – Client 6/4, 2016 at 20:51