Calculate "group characteristics" without ddply and merge

Asked 17/3, 2013 at 22:49 Answered 30/11, 2022 at 7:8

I wonder whether there is a more straighforward way to calculate a certain type of variables than the approach i normally take....

The example below probably explains it best. I have a dataframe with 2 columns (fruit and whether the fruit is rotten or not). I would like to, for each row, add e.g. the percentage of fruit of the same category that is rotten. For example, there are 4 entries for apples, 2 of them are rotten, so each row for apple should read 0.5. The target values (purely as illustration) are included in the "desired outcome" column.

I have previously approached this problem by * using the "ddply" command on the fruit variable (with sum/lenght as function), creating a new 3*2 dataframe * use the "merge" command to link these values back into the old dataframe.

This feels like a roundabout way, and I was wondering whether there are better/faster way of doing this! ideallly a generic approach, that is easily adjusted if one instead of the percentage needs to determine whether e.g. all fruits are rotten, any fruits are rotten, etc. etc. etc....

Many thanks in advance,

    Fruit Rotten Desired_Outcome_PercRotten
1   Apple      1                        0.5
2   Apple      1                        0.5
3   Apple      0                        0.5
4   Apple      0                        0.5
5    Pear      1                       0.75
6    Pear      1                       0.75
7    Pear      1                       0.75
8    Pear      0                       0.75
9  Cherry      0                          0
10 Cherry      0                          0
11 Cherry      0                          0

#create example datagram; desired outcome columns are purely inserted as illustrative of target outcomes
Fruit=c(rep("Apple",4),rep("Pear",4),rep("Cherry",3))
Rotten=c(1,1,0,0,1,1,1,0,0,0,0)
Desired_Outcome_PercRotten=c(0.5,0.5,0.5,0.5,0.75,0.75,0.75,0.75,0,0,0)
df=as.data.frame(cbind(Fruit,Rotten,Desired_Outcome_PercRotten))        
df

Lankester answered 17/3, 2013 at 22:49 Comment(2)

Related discussion on the first part of your question: https://mcmap.net/q/204213/-calculate-the-mean-by-group/636656 . Answers below are nicer because they combine the split-apply-combine operation with the merging in a single step. – Catastrophe 18/3, 2013 at 0:59

user1885116, use df <- data.frame(Fruit, Rotten, Desired_Outcome_PercRotten) to create a data.frame from scratch instead of as.data.frame with cbind. It gets the column Rotten as factor, which is undesirable. – Boutwell 18/3, 2013 at 13:6

You can do this with just ddply and mutate:

# changed summarise to transform on joran's suggestion
# changed transform to mutate on mnel's suggestion :)
ddply(df, .(Fruit), mutate, Perc = sum(Rotten)/length(Rotten))

#     Fruit Rotten Perc
# 1   Apple      1 0.50
# 2   Apple      1 0.50
# 3   Apple      0 0.50
# 4   Apple      0 0.50
# 5  Cherry      0 0.00
# 6  Cherry      0 0.00
# 7  Cherry      0 0.00
# 8    Pear      1 0.75
# 9    Pear      1 0.75
# 10   Pear      1 0.75
# 11   Pear      0 0.75

Boutwell answered 17/3, 2013 at 23:14 Comment(1)

I would also suggest mutate (the plyr implementation of transform which allows you to refer to created columns eg ddply(df ,.(Fruit), mutate, percR = sum(Rotten) / length(Rotten), pp = Rotten *percR) compared to ddply(dd ,.(Fruit), transform, percR = sum(Rotten) / length(Rotten), pp = Rotten *percR) – Nilsson 18/3, 2013 at 0:45

data.table is super fast as it updates by reference. What about using it?

library(data.table)

dt=data.table(Fruit,Rotten,Desired_Outcome_PercRotten)

dt[,test:=sum(Rotten)/.N,by="Fruit"]
#dt
#     Fruit Rotten Desired_Outcome_PercRotten test
# 1:  Apple      1                       0.50 0.50
# 2:  Apple      1                       0.50 0.50
# 3:  Apple      0                       0.50 0.50
# 4:  Apple      0                       0.50 0.50
# 5:   Pear      1                       0.75 0.75
# 6:   Pear      1                       0.75 0.75
# 7:   Pear      1                       0.75 0.75
# 8:   Pear      0                       0.75 0.75
# 9: Cherry      0                       0.00 0.00
#10: Cherry      0                       0.00 0.00
#11: Cherry      0                       0.00 0.00

Shy answered 17/3, 2013 at 23:16 Comment(0)

One solution in base R is to use ave.

within(df, {
  ## Because of how you've created your data.frame
  ##   Rotten is actually a factor. So, we need to
  ##   convert it to numeric before we can use mean
  Rotten <- as.numeric(as.character(Rotten))
  NewCol <- ave(Rotten, Fruit)
})
    Fruit Rotten Desired_Outcome_PercRotten NewCol
1   Apple      1                        0.5   0.50
2   Apple      1                        0.5   0.50
3   Apple      0                        0.5   0.50
4   Apple      0                        0.5   0.50
5    Pear      1                       0.75   0.75
6    Pear      1                       0.75   0.75
7    Pear      1                       0.75   0.75
8    Pear      0                       0.75   0.75
9  Cherry      0                          0   0.00
10 Cherry      0                          0   0.00

or shorter:

transform(df, desired = ave(Rotten == 1, Fruit))

The default function applied with ave is mean, hence I have not included it here. However, you could specify a different function by appending FUN = some-function-here if you wanted to do something different.

Sonja answered 18/3, 2013 at 4:44 Comment(0)

As ave is already out, let me add one solution using my base R function of choice: aggregate.

You can get the desired data simply with:

aggregate(as.numeric(as.character(Rotten)) ~ Fruit, df, mean)

However, you will need to still merge it afterwards (or in one piece):

merge(df, aggregate(as.numeric(as.character(Rotten)) ~ Fruit, df, mean))

Umbrella answered 18/3, 2013 at 13:24 Comment(0)

Using dplyr -

For dplyr version 1.1.0 and above -

library(dplyr)

df %>% mutate(Result = sum(Rotten)/n(), .by = Fruit)

#    Fruit Rotten Desired_Outcome_PercRotten Result
#1   Apple      1                       0.50   0.50
#2   Apple      1                       0.50   0.50
#3   Apple      0                       0.50   0.50
#4   Apple      0                       0.50   0.50
#5    Pear      1                       0.75   0.75
#6    Pear      1                       0.75   0.75
#7    Pear      1                       0.75   0.75
#8    Pear      0                       0.75   0.75
#9  Cherry      0                       0.00   0.00
#10 Cherry      0                       0.00   0.00
#11 Cherry      0                       0.00   0.00

Using .by argument provides temporary grouping and avoids the use of group_by and ungroup statements.

For dplyr version < 1.1.0

df %>%
  group_by(Fruit) %>%
  mutate(Result = sum(Rotten)/n()) %>%
  ungroup

data

df <-  data.frame(Fruit,Rotten,Desired_Outcome_PercRotten)

Divergency answered 30/11, 2022 at 7:8 Comment(0)

Recommended topics

Hot tags