R Dplyr mutate, calculating standard deviation for each row
Asked Answered
B

3

8

I am trying to calculate the mean and standard deviation from certain columns in a data frame, and return those values to new columns in the data frame. I can get this to work for mean:

library(dplyr)
mtcars = mutate(mtcars, mean=(hp+drat+wt)/3)

However, when I try to do the same for standard deviation, I have an issue, because I cannot hardcode the equation like I did for mean very easily. So, I try to use a function, as follows:

mtcars = mutate(mtcars, mean=(hp+drat+wt)/3, stdev = sd(hp,drat,wt))

Resulting in the error "Error in sd(hp, drat, wt) : unused argument (wt)". How can I correct my syntax? Thank you.

Brutus answered 11/4, 2015 at 18:38 Comment(4)
In order to calculate the mean you actually wrote the formula but in order to calculate SD you used the built in sd function is some strange way. Doesn't it look inconsistent to you?Conflux
Yes, that is why I stated "when I try to do the same for standard deviation, I have an issue, because I cannot hardcode the equation like I did for mean very easily. So, I try to use a function." I am not sure why you think I used the sd function in some strange way, even though I am sure that is true. The sd function seems to take in a vector of numeric, for instance sd(c(3,5,6)). Even though I am sure it is obvious to you, why is what I am doing not correct? Thanks.Brutus
Perhaps what @DavidArenburg is suggesting is that your call to sd is incorrect, which it is, in a commonly mistaken way. For instance, try sd(1,2,3), then read ?sd and see (1) that it describes the first argument as "x: a numeric vector", and (2) it specifically does not include "..." (ellipses, that would allow for an arbitrary number of arguments as you are providing).Densmore
@Brutus Using + to get mean may not work as expected if there are NA's, In the mean and rowMeans, there are options for removing NA, ie. na.rm=TRUE.Langue
L
10

You could try

library(dplyr)
library(matrixStats)
nm1 <- c('hp', 'drat', 'wt')
res1 <- mtcars %>% 
           mutate(Mean= rowMeans(.[nm1]), stdev=rowSds(as.matrix(.[nm1])))

head(res1,3)
#   mpg cyl disp  hp drat    wt  qsec vs am gear carb     Mean    stdev
#1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 38.84000 61.62969
#2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 38.92500 61.55489
#3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 33.05667 51.91809

Or using do

res2 <- mtcars %>% 
             rowwise() %>%
             do(data.frame(., Mean=mean(unlist(.[nm1])),
                         stdev=sd(unlist(.[nm1]))))

head(res2,3)
#   mpg cyl disp  hp drat    wt  qsec vs am gear carb     Mean    stdev
#1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 38.84000 61.62969
#2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 38.92500 61.55489
#3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 33.05667 51.91809
Langue answered 11/4, 2015 at 18:56 Comment(10)
@arkrun. Thanks, but when I run your first code, I get an error "Error in .[nm1] : object of type 'closure' is not subsettable"Brutus
@Brutus I am not sure about the problem. Are you using recent versions of dplyr? I used dplyr_0.4.1.9000Langue
Thanks @akrun. I just did install.packages("dplyr") and then sessionInfo() showed it was version dplyr_0.4.1 . I reran the code and got the same error!Brutus
@Brutus Can you try by mtcars %>% mutate(.. as in the update.Langue
You're selecting the columns, so you should edit as.matrix(.[nm1]) to as.matrix(.[ ,nm1]).Rancor
@EhsanM.Kermani We selected the columns from a data.frame for which .[nm1] gets the columns by default and then only converted to matrix. If it was already a matrix, then .[, nm1] would be the right way. So, in this case either one works. If you have doubt, please check the result of both cases, would be the same.Langue
I get a bunch of Warnings using the rowwise() function, but if I use group_by(row_number()) (or some other explicit rowID) those Warnings go away.Flocculant
@BrianD it is the deprecated warning ``do()` is deprecated as of dplyr 1.0.0.`. this is an old post. The package gets updated with new functioons and old functions are deprecatedLangue
ah, I was using dplyr 0.8.3, and R 3.5.3Flocculant
that is a bit oldLangue
C
5

You can also write your own vectorised RowSD function as in

RowSD <- function(x) {
  sqrt(rowSums((x - rowMeans(x))^2)/(dim(x)[2] - 1))
}

and then

mtcars %>% 
  mutate(mean = (hp + drat + wt)/3, stdev = RowSD(cbind(hp, drat, wt)))
##     mpg cyl  disp  hp drat    wt  qsec vs am gear carb      mean     stdev
## 1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4  38.84000  61.62969
## 2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4  38.92500  61.55489
## 3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1  33.05667  51.91809
## 4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1  38.76500  61.69136
## 5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2  60.53000  99.13403
## 6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1  37.07333  58.82726
## ...
Conflux answered 11/4, 2015 at 18:59 Comment(1)
Worked really wellSyllabic
D
4

Not much change needed, just add rowwise() (thanks @akrun for the comment) and wrap your column names in c(...) (to fix the error):

library(dplyr)
mtcars %>%
    rowwise() %>%
    mutate(mean=(hp+drat+wt)/3, stdev = sd(c(hp,drat,wt)))
## Source: local data frame [32 x 13]
## Groups: <by row>
##     mpg cyl  disp  hp drat    wt  qsec vs am gear carb     mean     stdev
## 1  21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4 38.84000  61.62969
## 2  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4 38.92500  61.55489
## 3  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1 33.05667  51.91809
## 4  21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1 38.76500  61.69136
## 5  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2 60.53000  99.13403
## 6  18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1 37.07333  58.82726
## 7  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4 83.92667 139.49371
## 8  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2 22.96000  33.81056
## 9  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2 34.02333  52.80875
## 10 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4 43.45333  68.88985
## ..  ... ...   ... ...  ...   ...   ... .. ..  ...  ...      ...       ...
Densmore answered 12/4, 2015 at 1:44 Comment(1)
Hi, Using same command giving me identical value for sd. mean is working fine. See the output belowPenicillium

© 2022 - 2024 — McMap. All rights reserved.