Is there such "colsd" in R?
Asked Answered
F

8

29

I am using in my code colSums but I also need the standard deviation beside the sum. I searched in the internet and found this page which contain only:

colSums
colMeans

http://stat.ethz.ch/R-manual/R-devel/library/base/html/colSums.html

I tried this:

colSd

but I got this error:

Error: could not find function "colSd"

How I can do the same thing but for standard deviation:

colSd

Here is the code:

results <- colSums(x,na.rm=TRUE)#### here I want colsd
Foret answered 9/7, 2013 at 13:42 Comment(4)
No, but if you have a data.frame, try sapply(x, sd) or more general, apply(x, 2, sd).Gratification
stat.ethz.ch/pipermail/r-help/2002-March/019606.htmlCranial
I also like the numcolwise function from the plyr package for this type of thing.Fite
Yes, it'd be nice to have such functions. colMeans and colSums are much faster than apply(X, 2, ...) counterparts...Valval
A
27

I want to provide a fourth (very similar to @Thomas) approach and some benchmarking:

library("microbenchmark")
library("matrixStats")

colSdApply <- function(x, ...)apply(X=x, MARGIN=2, FUN=sd, ...)
colSdMatrixStats <- colSds

colSdColMeans <- function(x, na.rm=TRUE) {
  if (na.rm) {
    n <- colSums(!is.na(x)) # thanks @flodel
  } else {
    n <- nrow(x)
  }
  colVar <- colMeans(x*x, na.rm=na.rm) - (colMeans(x, na.rm=na.rm))^2
  return(sqrt(colVar * n/(n-1)))
}

colSdThomas <- function(x)sqrt(rowMeans((t(x)-colMeans(x))^2)*((dim(x)[1])/(dim(x)[1]-1)))

m <- matrix(runif(1e7), nrow=1e3)

microbenchmark(colSdApply(m), colSdMatrixStats(m), colSdColMeans(m), colSdThomas(m))

# Unit: milliseconds
#                 expr      min       lq   median       uq      max neval
#        colSdApply(m) 435.7346 448.8673 456.6176 476.8373 512.9783   100
#  colSdMatrixStats(m) 344.6416 357.5439 383.8736 389.0258 465.5715   100
#     colSdColMeans(m) 124.2028 128.9016 132.9446 137.6254 172.6407   100
#       colSdThomas(m) 231.5567 240.3824 245.4072 274.6611 307.3806   100


all.equal(colSdApply(m), colSdMatrixStats(m))
# [1] TRUE
all.equal(colSdApply(m), colSdColMeans(m))
# [1] TRUE
all.equal(colSdApply(m), colSdThomas(m))
# [1] TRUE
Aracelyaraceous answered 9/7, 2013 at 15:0 Comment(5)
@sacvf: I missed the ... in colMeans (see my edit). Now colSdColMeans(x, na.rm=TRUE) should work.Aracelyaraceous
I think to deal with NA you will have to use something like n <- colSums(!is.na(x)).Amyloid
+1 for the benchmarking of different methods. This was useful and new for me.Shanklin
@Amyloid thanks, you are right, I forgot that! (I update my code and the benchmark results.)Aracelyaraceous
In the meantime the benchmark seems to be outdated and matrixStats::colSds 4-10x faster than the rest.Mores
B
7

colSds and rowSds are two of many similar functions in the matrixStats package

Broadleaf answered 9/7, 2013 at 13:57 Comment(2)
Those functions in that package aren't actually doing anything fancy; they are just as slow as apply(x, 2, sd).Sihonn
Not anymore. They are all written in C/C++ these daysNadabas
S
7

This is the quickest and shortest way to calculate the standard deviation of the columns:

sqrt(diag(cov(data_matrix)))

Since the diagonal of a co-variance matrix consists of the variances of each variable, we do the following:

  • Calculate the co-variance matrix using cov
  • Extract the diagonal of the matrix using diag
  • Take the square root of the diagonal values using sqrt in order to get the standard deviation

I hope that helps :)

Sterol answered 23/1, 2018 at 5:38 Comment(0)
S
5

Use the following:

colSd <- function (x, na.rm=FALSE) apply(X=x, MARGIN=2, FUN=sd, na.rm=na.rm)
Shanklin answered 9/7, 2013 at 13:44 Comment(5)
I would do function(x, ...) apply(X = x, ...)).Gratification
I debated that myself, but then sd uses only one optional, so I thought of including that directly.Shanklin
@sacvf, it doesn't matter how many comments you pour in with "my results are NA, any idea why?", we can't help unless we see the data you're having a problem with. You should direct some of your energy into making your question reproducible.Valval
I also tried the solution given by @Henry but I got the smae:Just NAForet
@savcf: I had disappeared for some time. You can't seriously ask something like 'what you meant if check na.rm'. You should really produce a reproducible example to get a correct solution.Shanklin
R
5

I believe I have found a more elegant solution in diag(sqrt(var(data)))

This worked for me to get the standard deviation of each of my columns. However, it does compute a bunch of extra unnecessary covariances (and their square roots) along the way, so it isn't necessarily the most efficient approach. But if your data is small, it works excellently.

EDIT: I just realized that sqrt(diag(var(data))) is probably a bit more efficient, since it drops the unnecessary covariance terms earlier.

Rizal answered 13/4, 2014 at 23:34 Comment(0)
Q
4

I don't know if these are particularly fast, but why not just use the formulae for SD:

x <- data.frame(y = rnorm(1000,0,1), z = rnorm(1000,2,3))

# If you have a population:
colsdpop <- function(x,...)
     sqrt(rowMeans((t(x)-colMeans(x,...))^2,...))
colsdpop(x)
sd(x$y); sd(x$z) # won't match `sd`

# If you have a sample:
colsdsamp <- function(x)
    sqrt( (rowMeans((t(x)-colMeans(x))^2)*((dim(x)[1])/(dim(x)[1]-1))) )
colsdsamp(x)
sd(x$y); sd(x$z) # will match `sd`

Note: the sample solution won't handle NAs well. One could incorporate something like apply(x,2,function(z) sum(!is.na(z))) into the right-most part of the formula to get an appropriate denominator, but it would get really murky quite quickly.

Queri answered 9/7, 2013 at 14:43 Comment(0)
W
3

I usually do column sd's with apply:

x <- data.frame(y = rnorm(20,0,1), z = rnorm(20,2,3))

> apply(x, 2, sd)
        y         z 
0.8022729 3.4700314 

Verify:

> sd(x$y)
[1] 0.8022729

> sd(x$z)
[1] 3.470031

You can also do it with dplyr easily:

library(dplyr)
library(magrittr) # for pipes

> x %>% summarize_all(.,sd)
          y        z
1 0.8022729 3.470031
Wolfish answered 10/10, 2019 at 3:59 Comment(0)
S
1

You can just use apply function

all.sd <- apply(data, 2,sd)

Stlouis answered 3/11, 2019 at 15:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.