Create combinations by group and sum
Asked Answered
M

3

5

I have data of names within an ID number along with a number of associated values. It looks something like this:

structure(list(id = c("a", "a", "b", "b"), name = c("bob", "jane", 
"mark", "brittney"), number = c(1L, 2L, 1L, 2L), value = c(1L, 
2L, 1L, 2L)), class = "data.frame", row.names = c(NA, -4L))

#   id     name number value
# 1  a      bob      1     1
# 2  a     jane      2     2
# 3  b     mark      1     1
# 4  b brittney      2     2

I would like to create all the combinations of name, regardless of how many there are, and paste them together separated with commas, and sum their number and value within each id. The desired output from the example above is then:

structure(list(id = c("a", "a", "a", "b", "b", "b"), name = c("bob", 
"jane", "bob, jane", "mark", "brittney", "mark, brittney"), number = c(1L, 
2L, 3L, 1L, 2L, 3L), value = c(1L, 2L, 3L, 1L, 2L, 3L)), class = "data.frame", row.names = c(NA, -6L))

#   id           name number value
# 1  a            bob      1     1
# 2  a           jane      2     2
# 3  a      bob, jane      3     3
# 4  b           mark      1     1
# 5  b       brittney      2     2
# 6  b mark, brittney      3     3

Thanks all!

Monanthous answered 1/7, 2022 at 1:22 Comment(0)
S
4

You can create pairwise indices using combn() and expand the data frame with these using slice(). Then just group by these row pairs and summarise. I'm assuming you want pairwise combinations but this can be adapted for larger sets if needed. Some code to handle groups < 2 is included but can be removed if these don't exist in your data.

library(dplyr)
library(purrr)

df1 %>%
  group_by(id) %>%
  slice(c(combn(seq(n()), min(n(), 2)))) %>%
  mutate(id2 = (row_number()-1) %/% 2) %>%
  group_by(id, id2) %>%
  summarise(name = toString(name),
            across(where(is.numeric), sum), .groups = "drop") %>%
  select(-id2) %>%
  bind_rows(df1 %>%
              group_by(id) %>%
              filter(n() > 1), .) %>%
  arrange(id) %>%
  ungroup()

# A tibble: 6 × 4
  id    name           number value
  <chr> <chr>           <int> <int>
1 a     bob                 1     1
2 a     jane                2     2
3 a     bob, jane           3     3
4 b     mark                1     1
5 b     brittney            2     2
6 b     mark, brittney      3     3

Edit:

To adapt for all possible combinations you can iterate over the values up to the max group size. Using edited data which has a couple of rows added to the first group:

map_df(seq(max(table(df2$id))), ~
         df2 %>%
         group_by(id) %>%
         slice(c(combn(seq(n()), .x * (.x <= n())))) %>%
         mutate(id2 = (row_number() - 1) %/% .x) %>%
         group_by(id, id2) %>%
         summarise(name = toString(name),
                   across(where(is.numeric), sum), .groups = "drop")
       ) %>%
  select(-id2) %>%
  arrange(id)

# A tibble: 18 × 4
   id    name                      number value
   <chr> <chr>                      <int> <int>
 1 a     bob                            1     1
 2 a     jane                           2     2
 3 a     sophie                         1     1
 4 a     jeremy                         2     2
 5 a     bob, jane                      3     3
 6 a     bob, sophie                    2     2
 7 a     bob, jeremy                    3     3
 8 a     jane, sophie                   3     3
 9 a     jane, jeremy                   4     4
10 a     sophie, jeremy                 3     3
11 a     bob, jane, sophie              4     4
12 a     bob, jane, jeremy              5     5
13 a     bob, sophie, jeremy            4     4
14 a     jane, sophie, jeremy           5     5
15 a     bob, jane, sophie, jeremy      6     6
16 b     mark                           3     5
17 b     brittney                       4     6
18 b     mark, brittney                 7    11

Data for df2:

df2 <- structure(list(id = c("a", "a", "a", "a", "b", "b"), name = c("bob", 
                                                                     "jane", "sophie", "jeremy", "mark", "brittney"), number = c(1L, 
                                                                                                                                 2L, 1L, 2L, 3L, 4L), value = c(1L, 2L, 1L, 2L, 5L, 6L)), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                              -6L))
Subdue answered 1/7, 2022 at 2:1 Comment(4)
Thank you! This is great, clever. I do need all combinations of names, not just pairwise, within each id (apologies for the too-minimal example). So changing that is lines 3 and 4? Did you have a specific tweak in mind?Monanthous
Can you expand on your expected output? If for example a group contained 4 names, would you want the output to contain 4 (single) + 6 (pair combination) + 4 (triplet combinations) + 1 (quad) rows (i.e. 15 rows for that group)?Subdue
@Monanthous - see edit.Subdue
Thank you! I really appreciate the help. This is super clever and something I will learn a lot from.Monanthous
A
5

You could use group_modify() + add_row():

library(dplyr)

df %>%
  group_by(id) %>%
  group_modify( ~ .x %>%
    summarise(name = toString(name), across(c(number, value), sum)) %>%
    add_row(.x, .)
  ) %>%
  ungroup()

# # A tibble: 6 × 4
#   id    name           number value
#   <chr> <chr>           <int> <int>
# 1 a     bob                 1     1
# 2 a     jane                2     2
# 3 a     bob, jane           3     3
# 4 b     mark                1     1
# 5 b     brittney            2     2
# 6 b     mark, brittney      3     3
Adames answered 1/7, 2022 at 1:55 Comment(3)
I don't think this solves OPs problem - they're asking for combinations of cases. OPs example data and output was perhaps a bit too minimal.Subdue
@RitchieSacramento Yes, maybe. A simplified description of questions leads to a simplified answer. Thanks for the useful comment.Adames
Thanks Darren! Apologies for the too minimal example. This is great though. Yes; I need all combinations of names within each id, so I don't think this will work. Upvoted and appreciated though!Monanthous
S
4

You can create pairwise indices using combn() and expand the data frame with these using slice(). Then just group by these row pairs and summarise. I'm assuming you want pairwise combinations but this can be adapted for larger sets if needed. Some code to handle groups < 2 is included but can be removed if these don't exist in your data.

library(dplyr)
library(purrr)

df1 %>%
  group_by(id) %>%
  slice(c(combn(seq(n()), min(n(), 2)))) %>%
  mutate(id2 = (row_number()-1) %/% 2) %>%
  group_by(id, id2) %>%
  summarise(name = toString(name),
            across(where(is.numeric), sum), .groups = "drop") %>%
  select(-id2) %>%
  bind_rows(df1 %>%
              group_by(id) %>%
              filter(n() > 1), .) %>%
  arrange(id) %>%
  ungroup()

# A tibble: 6 × 4
  id    name           number value
  <chr> <chr>           <int> <int>
1 a     bob                 1     1
2 a     jane                2     2
3 a     bob, jane           3     3
4 b     mark                1     1
5 b     brittney            2     2
6 b     mark, brittney      3     3

Edit:

To adapt for all possible combinations you can iterate over the values up to the max group size. Using edited data which has a couple of rows added to the first group:

map_df(seq(max(table(df2$id))), ~
         df2 %>%
         group_by(id) %>%
         slice(c(combn(seq(n()), .x * (.x <= n())))) %>%
         mutate(id2 = (row_number() - 1) %/% .x) %>%
         group_by(id, id2) %>%
         summarise(name = toString(name),
                   across(where(is.numeric), sum), .groups = "drop")
       ) %>%
  select(-id2) %>%
  arrange(id)

# A tibble: 18 × 4
   id    name                      number value
   <chr> <chr>                      <int> <int>
 1 a     bob                            1     1
 2 a     jane                           2     2
 3 a     sophie                         1     1
 4 a     jeremy                         2     2
 5 a     bob, jane                      3     3
 6 a     bob, sophie                    2     2
 7 a     bob, jeremy                    3     3
 8 a     jane, sophie                   3     3
 9 a     jane, jeremy                   4     4
10 a     sophie, jeremy                 3     3
11 a     bob, jane, sophie              4     4
12 a     bob, jane, jeremy              5     5
13 a     bob, sophie, jeremy            4     4
14 a     jane, sophie, jeremy           5     5
15 a     bob, jane, sophie, jeremy      6     6
16 b     mark                           3     5
17 b     brittney                       4     6
18 b     mark, brittney                 7    11

Data for df2:

df2 <- structure(list(id = c("a", "a", "a", "a", "b", "b"), name = c("bob", 
                                                                     "jane", "sophie", "jeremy", "mark", "brittney"), number = c(1L, 
                                                                                                                                 2L, 1L, 2L, 3L, 4L), value = c(1L, 2L, 1L, 2L, 5L, 6L)), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                              -6L))
Subdue answered 1/7, 2022 at 2:1 Comment(4)
Thank you! This is great, clever. I do need all combinations of names, not just pairwise, within each id (apologies for the too-minimal example). So changing that is lines 3 and 4? Did you have a specific tweak in mind?Monanthous
Can you expand on your expected output? If for example a group contained 4 names, would you want the output to contain 4 (single) + 6 (pair combination) + 4 (triplet combinations) + 1 (quad) rows (i.e. 15 rows for that group)?Subdue
@Monanthous - see edit.Subdue
Thank you! I really appreciate the help. This is super clever and something I will learn a lot from.Monanthous
B
2

A data.table option

setDT(df)[
  ,
  lapply(
    .SD,
    function(x) {
      unlist(
        lapply(
          seq_along(x),
          combn,
          x = x,
          function(v) {
            ifelse(all(is.character(v)), toString, sum)(v)
          }
        )
      )
    }
  ),
  id
]

gives

   id           name number value
1:  a            bob      1     1
2:  a           jane      2     2
3:  a      bob, jane      3     3
4:  b           mark      1     1
5:  b       brittney      2     2
6:  b mark, brittney      3     3
Backfill answered 1/7, 2022 at 13:16 Comment(1)
This is great! Thank you! Upvoted. Works and generalizes to cases with many values for names. I already accepted the other answer but will learn from this and it is great to have for others' future reference.Monanthous

© 2022 - 2024 — McMap. All rights reserved.