Summary statistics using apply family for different factor levels
Asked Answered
K

2

1

I am trying to find the summary statistics for different factor levels.

data.frame(apply(final_data[Company=="BPO",c(66:84)],2,summary))  

Now I have different values for company - i can repeat the statement for different values. I know it can be automated - using apply family (ddply,tapply,sapply), but I am not getting it right.

Kissee answered 19/12, 2013 at 18:29 Comment(0)
D
3

You could split on company and then use your function:

spl = split(final_data, final_data$Company)
list.of.summaries = lapply(spl, function(x) data.frame(apply(x[,66:84], 2, summary)))
Doggoned answered 19/12, 2013 at 18:35 Comment(2)
thanks. I got correlation to work. with following by(final_data[,c(66:85)],Company,function(x) cor(x))Kissee
Sure, or: list.of.cor = lapply(spl, function(x) cor(x[,66:84]))Doggoned
C
2

You may want to think about using the by or tapply functions. This will allow you to skip the explicit call to split. Here's an example, since you haven't provided data.

# some example data
set.seed(1)
df <- data.frame(x = as.factor(rep(1:5, each=10)), y1=rnorm(50), y2=rnorm(50))

# with `tapply`
a <- do.call(rbind, sapply(df[,2:3], function(i) tapply(i, df$x, summary)))
# with `by`
a <- do.call(rbind, sapply(df[,2:3], function(i) by(i, df$x, summary)))

Here's the output:

> a
         Min.  1st Qu.    Median    Mean 3rd Qu.   Max.
 [1,] -0.8356 -0.54620  0.256600  0.1322  0.5537 1.5950
 [2,] -2.2150 -0.03775  0.491900  0.2488  0.9132 1.5120
 [3,] -1.9890 -0.39760  0.009218 -0.1337  0.5694 0.9190
 [4,] -1.3770 -0.32140 -0.056560  0.1207  0.6693 1.3590
 [5,] -0.7075 -0.23120  0.126100  0.1341  0.6619 0.8811
 [6,] -1.1290 -0.55080  0.103000  0.1435  0.5268 1.9800
 [7,] -1.8050 -0.02243  0.171000  0.4512  1.2720 2.4020
 [8,] -1.2540 -0.67980 -0.221100 -0.2477  0.2372 0.6107
 [9,] -1.5240 -0.26190  0.300000  0.1274  0.5380 1.1780
[10,] -1.2770 -0.56560  0.042540  0.1123  1.0450 1.5870

You might also want to combine this with the variable and level names to know what's going on:

b <- expand.grid(level=levels(df$x),var=names(df[,2:3]))
cbind(a,b)

Here's the output of that:

> cbind(b,a)
   level var    Min.  1st Qu.    Median    Mean 3rd Qu.   Max.
1      1  y1 -0.8356 -0.54620  0.256600  0.1322  0.5537 1.5950
2      2  y1 -2.2150 -0.03775  0.491900  0.2488  0.9132 1.5120
3      3  y1 -1.9890 -0.39760  0.009218 -0.1337  0.5694 0.9190
4      4  y1 -1.3770 -0.32140 -0.056560  0.1207  0.6693 1.3590
5      5  y1 -0.7075 -0.23120  0.126100  0.1341  0.6619 0.8811
6      1  y2 -1.1290 -0.55080  0.103000  0.1435  0.5268 1.9800
7      2  y2 -1.8050 -0.02243  0.171000  0.4512  1.2720 2.4020
8      3  y2 -1.2540 -0.67980 -0.221100 -0.2477  0.2372 0.6107
9      4  y2 -1.5240 -0.26190  0.300000  0.1274  0.5380 1.1780
10     5  y2 -1.2770 -0.56560  0.042540  0.1123  1.0450 1.5870
Crude answered 22/12, 2013 at 15:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.