average between duplicated rows in R
Asked Answered
K

5

12

I have a data frame df with rows that are duplicates for the names column but not for the values column:

name    value   etc1    etc2
A       9       1       X
A       10      1       X
A       11      1       X
B       2       1       Y
C       40      1       Y
C       50      1       Y

I need to aggregate the duplicate names into one row, while calculating the mean over the values column. The expected output is as follows:

name    value   etc1    etc2
A       10      1       X
B       2       1       Y
C       45      1       Y

I have tried to use df[duplicated(df$name),] but of course this does not give me the mean over the duplicates. I would like to use aggregate(), but the problem is that the FUN part of this function will apply to all the other columns as well, and among other problems, it will not be able to compute char content. Since all the other columns have the same content over the "duplicates", I need them to be aggregated as is just like the name column. Any hints...?

Kaiserslautern answered 29/6, 2013 at 18:51 Comment(3)
Are your other columns etcX also guaranteed to be the same for rows with the same name?Raspy
@HongOoi Yes I believe so, I filtered them in the previous step to get similar contents.Kaiserslautern
@agstudy Sorry, the expected output was already included in my post, but I forgot to say so in my sentence. This has been corrected by user Metrics.Kaiserslautern
I
15

Here a data.table solution. The solution is general in the sense it will work even for a data.frame with 60 columns. Since I group the data by all variables different of value( See how I create keys below)

library(data.table)
dat <- read.table(text='name    value   etc1    etc2
A       9       1       X
A       10      1       X
A       11      1       X
B       2       1       Y
C       40      1       Y
C       50      1       Y',header=TRUE)
keys <- colnames(dat)[!grepl('value',colnames(dat))]
X <- as.data.table(dat)
X[,list(mm= mean(value)),keys]
  name etc1 etc2 mm
1:    A    1    X 10
2:    B    1    Y  2
3:    C    1    Y 45

EDIT extend to more than one value variable

In case you have more than one numeric variables on which you want to compute the mean , For example, if your data look like this

  name value etc1 etc2     value1
1    A     9    1    X  2.1763485
2    A    10    1    X -0.7954326
3    A    11    1    X -0.5839844
4    B     2    1    Y -0.5188709
5    C    40    1    Y -0.8300233
6    C    50    1    Y -0.7787496

The above solution can be extended like this :

X[,lapply(.SD,mean),keys]
   name etc1 etc2 value     value1
1:    A    1    X    10  0.2656438
2:    B    1    Y     2 -0.5188709
3:    C    1    Y    45 -0.8043865

This will compute the mean for all variables that don't exist in keys list.

Intermediary answered 29/6, 2013 at 20:11 Comment(0)
I
12

You can use aggregate() function like below:

aggregate(df$value,by=list(name=df$name,etc1=df$etc1,etc2=df$etc2),data=df,FUN=mean)
Independency answered 17/2, 2015 at 14:8 Comment(1)
You don't need to use df$ inside aggregate because it has a data argument where you specify the data source (df in this case). There is also a formula interface for aggregate that may be easier to read.Janettajanette
D
2

The code (written by Metrics) is almost working except in one place (.name). I slightly modified it:

sample<- structure(list(name = structure(c(1L, 1L, 1L, 2L, 3L, 3L), .Label = c("A", 
    "B", "C"), class = "factor"), value = c(9L, 10L, 11L, 2L, 40L, 
    50L), etc1 = c(1L, 1L, 1L, 1L, 1L, 1L), etc2 = structure(c(1L, 
    1L, 1L, 2L, 2L, 2L), .Label = c("X", "Y"), class = "factor")), .Names = c("name", 
    "value", "etc1", "etc2"), class = "data.frame", row.names = c(NA, 
    -6L))

sample.m <- ddply(sample, 'name', summarize, value =mean(value), etc1=head(etc1,1), etc2=head(etc2,1))

sample.m
      name value etc1 etc2
    1    A    10    1    X
    2    B     2    1    Y
    3    C    45    1    Y
Detoxify answered 29/6, 2013 at 19:34 Comment(2)
Many thanks! Would you happen to know if there is a shortcut I can use so that I don't have to enter the names of all the other columns? There are actually many more than in the example I gave.Kaiserslautern
@Intermediary I didn't know this either. Thanks for pointing this out.Kaiserslautern
D
1

Assuming your dataframe is df.

install.packages("plyr")
library(plyr)



df<- structure(list(name = structure(c(1L, 1L, 1L, 2L, 3L, 3L), .Label = c("A", 
    "B", "C"), class = "factor"), value = c(9L, 10L, 11L, 2L, 40L, 
    50L), etc1 = c(1L, 1L, 1L, 1L, 1L, 1L), etc2 = structure(c(1L, 
    1L, 1L, 2L, 2L, 2L), .Label = c("X", "Y"), class = "factor")), .Names = c("name", 
    "value", "etc1", "etc2"), class = "data.frame", row.names = c(NA, 
    -6L))

df.m<-ddply(df,.(name),summarize, value=mean(value),etc1=head(etc1,1),etc2=head(etc2,1))

df.m
 name value etc1 etc2
1    A      10    1    X
2    B       2    1    Y
3    C      45    1    Y
Dart answered 29/6, 2013 at 18:58 Comment(1)
Thanks a lot! There is one caveat though. In the real data set I have about 60 other columns that do not change over the duplicates and should be left intact, do I need to enter them one by one as arguments of ddply() or is there a shortcut technique?Kaiserslautern
C
0

This simple one worked for me:

avg_data <- aggregate( . ~ name, df, mean)

Using the "aggregate" function: apply the formula method ( x ~ y ) for all variables (.) based on the naming variable ("name"), within the data.frame "df", to perform the "mean" function.

Copt answered 10/1, 2023 at 19:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.