How to lookup and sum multiple columns in R

Asked 18/10, 2019 at 13:8 Answered 2/7 at 20:4

Suppose I have 2 dataframes structured as such:

GROUPS:

P1      P2      P3      P4
123     213     312     231
345     123     213     567

INDIVIDUAL_RESULTS:

ID      SCORE
123     23
213     12
312     11
213     19
345     10
567     22

I want to add a column to the GROUPS which is a sum of each of their individual results:

P1      P2      P3      P4      SCORE
123     213     312     231     65

I've tried using various merge techniques, but have really just created a mess. I feel like there's a simple solution I just don't know about, would really appreciate some guidance!

Poetry answered 18/10, 2019 at 13:8 Comment(0)

d1=read.table(text="
P1      P2      P3      P4
123     213     312     231
345     123     213     567",h=T)

d2=read.table(text="
ID      SCORE
123     23
213     12
312     11
231     19
345     10
567     22",h=T)

I will be using the apply and match functions. Apply will apply the match function to each row of d1, match will find the matching values from the row of d1 and d2$ID (their indices) and then take the values in d2$SCORE at those indices. In the end we sum them up.

d1$SCORE=apply(d1,1,function(x){
  sum(d2$SCORE[match(x,d2$ID)])
})

and the result

   P1  P2  P3  P4 SCORE
1 123 213 312 231    65
2 345 123 213 567    67

Quadrangle answered 18/10, 2019 at 13:16 Comment(3)

Perhaps you could explain the logic behind the function you pass to apply in plain old English. – Titter 18/10, 2019 at 13:27

With apply, is there a way to specify a range of columns to apply the function to rather than the entire row? – Poetry 19/10, 2019 at 13:30

@Poetry You can do it inside the anonymous function apply(d1,1,function(x){some_function(x[2:5])}), which will do something only on the columns from 2 to 5. – Quadrangle 19/10, 2019 at 14:15

I would try a slow but could be an intuitive way for new users. I think the difficulty was created by the format of your data d1. If you do a little bit of tidy up:

library(tidyverse)
d1<-data.frame(t(d1))
colnames(d1) <-c("group1", "group2")
d1$P = row.names(d1)
d1<-d1 %>% 
  pivot_longer(
    cols = group1:group2, 
    names_to = "Group",
    values_to = "ID"
  )  

df <-left_join(d1, d2, by ="ID")
df

# A tibble: 8 x 4
  P     Group     ID SCORE
  <chr> <chr>  <int> <int>
1 P1    group1   123    23
2 P1    group2   345    10
3 P2    group1   213    12
4 P2    group2   123    23
5 P3    group1   312    11
6 P3    group2   213    12
7 P4    group1   231    19
8 P4    group2   567    22

Once you get the data to this more "conventional" format, we can easily work out a tidyverse solution.

df  %>% 
  group_by(Group) %>% 
  summarize(SCORE = sum(SCORE))
# A tibble: 2 x 2
  Group  SCORE
  <chr>  <int>
1 group1    65
2 group2    67

Bony answered 18/10, 2019 at 14:30 Comment(0)

Another possibility is to reformat the first data.frame to contain the group and subgroup Information:

groups <- tidyr::gather(d1,name,number,P1:P4)

These information could be added to the second data.frame and could be further used for different analyses. Such as aggregrations.

d2_groups <- merge(groups, d2, by.x = "number",by.y = "ID")

aggregate(d2_groups$SCORE,  by=list(groups = d2_groups$name), FUN=sum)

Dray answered 18/10, 2019 at 14:29 Comment(0)

Here is a dplyr solution that turns INDIVIDUAL_RESULTS into a named vector, where the names are ID and the values are SCORE. Then going row by row, you can look up those IDs and extract their values to be summed into a new column:

library(dplyr)

GROUPS |>
  rowwise() |>
  mutate(SCORE = sum(pull(INDIVIDUAL_RESULTS, SCORE, ID)[as.character(c_across(P1:P4))])) |>
  ungroup()

Note: this assumes that name-value pairs in INDIVIDUAL_RESULTS are unique (e.g. ID is not repeated).

Quadri answered 2/7 at 20:4 Comment(0)

Recommended topics

Hot tags