Summary statistics using apply family for different factor levels

K

2

1

I am trying to find the summary statistics for different factor levels.

data.frame(apply(final_data[Company=="BPO",c(66:84)],2,summary))

Now I have different values for company - i can repeat the statement for different values. I know it can be automated - using apply family (ddply,tapply,sapply), but I am not getting it right.

Kissee answered 19/12, 2013 at 18:29 Comment(0)

D

3

You could split on company and then use your function:

spl = split(final_data, final_data$Company)
list.of.summaries = lapply(spl, function(x) data.frame(apply(x[,66:84], 2, summary)))

Doggoned answered 19/12, 2013 at 18:35 Comment(2)

thanks. I got correlation to work. with following by(final_data[,c(66:85)],Company,function(x) cor(x)) – Kissee 19/12, 2013 at 19:1

Sure, or: list.of.cor = lapply(spl, function(x) cor(x[,66:84])) – Doggoned 19/12, 2013 at 19:3

C

2

You may want to think about using the by or tapply functions. This will allow you to skip the explicit call to split. Here's an example, since you haven't provided data.

# some example data
set.seed(1)
df <- data.frame(x = as.factor(rep(1:5, each=10)), y1=rnorm(50), y2=rnorm(50))

# with `tapply`
a <- do.call(rbind, sapply(df[,2:3], function(i) tapply(i, df$x, summary)))
# with `by`
a <- do.call(rbind, sapply(df[,2:3], function(i) by(i, df$x, summary)))

Here's the output:

> a
         Min.  1st Qu.    Median    Mean 3rd Qu.   Max.
 [1,] -0.8356 -0.54620  0.256600  0.1322  0.5537 1.5950
 [2,] -2.2150 -0.03775  0.491900  0.2488  0.9132 1.5120
 [3,] -1.9890 -0.39760  0.009218 -0.1337  0.5694 0.9190
 [4,] -1.3770 -0.32140 -0.056560  0.1207  0.6693 1.3590
 [5,] -0.7075 -0.23120  0.126100  0.1341  0.6619 0.8811
 [6,] -1.1290 -0.55080  0.103000  0.1435  0.5268 1.9800
 [7,] -1.8050 -0.02243  0.171000  0.4512  1.2720 2.4020
 [8,] -1.2540 -0.67980 -0.221100 -0.2477  0.2372 0.6107
 [9,] -1.5240 -0.26190  0.300000  0.1274  0.5380 1.1780
[10,] -1.2770 -0.56560  0.042540  0.1123  1.0450 1.5870

You might also want to combine this with the variable and level names to know what's going on:

b <- expand.grid(level=levels(df$x),var=names(df[,2:3]))
cbind(a,b)

Here's the output of that:

> cbind(b,a)
   level var    Min.  1st Qu.    Median    Mean 3rd Qu.   Max.
1      1  y1 -0.8356 -0.54620  0.256600  0.1322  0.5537 1.5950
2      2  y1 -2.2150 -0.03775  0.491900  0.2488  0.9132 1.5120
3      3  y1 -1.9890 -0.39760  0.009218 -0.1337  0.5694 0.9190
4      4  y1 -1.3770 -0.32140 -0.056560  0.1207  0.6693 1.3590
5      5  y1 -0.7075 -0.23120  0.126100  0.1341  0.6619 0.8811
6      1  y2 -1.1290 -0.55080  0.103000  0.1435  0.5268 1.9800
7      2  y2 -1.8050 -0.02243  0.171000  0.4512  1.2720 2.4020
8      3  y2 -1.2540 -0.67980 -0.221100 -0.2477  0.2372 0.6107
9      4  y2 -1.5240 -0.26190  0.300000  0.1274  0.5380 1.1780
10     5  y2 -1.2770 -0.56560  0.042540  0.1123  1.0450 1.5870

Crude answered 22/12, 2013 at 15:27 Comment(0)

Recommended topics

Hot tags