data.table: Using with=False and transforming function/summary function?
Asked Answered
C

2

7

I want to summarise several variables in data.table, output in wide format, output possibly as a list per variable. Since several other approaches did not work, I tried to do an outer lapply, giving the names of the variables as character vectors. I wanted to pass these in, using with=FALSE.

carsx=as.data.table(cars)
lapply( list(speed="speed",dist= "dist"), #error object 'ansvals' not found
    function(x)  carsx[,list(mean(x), min(x), max(x) ), with=FALSE ] ) 

Since this does not work, I tried the more simple approach without lapply.

carsx[,list(mean("speed"), min("speed"), max("speed") ), with=FALSE ] #error object 'ansvals' not found

This does not work either. Is there any way to do something like this? Is this behaviour of 'with' wanted? (I am aware that ?data.table mentions with only to select columns, but in my case it would be useful to be able to transform them as well)

When with=FALSE, j is a vector of names or positions to select, similar to a data.frame. with=FALSE is often useful in data.table to select columns dynamically.

EDIT My aim is to get a summary per group in wide format, for different variables. I tried to extend the following, which works only for one variable, for a list of variables.

carsx[,list(mean(speed), min(speed), max(speed) ) ,by=(dist>50)

Lamentably SO doesnt let me post my other question. There I described that I want an output similiar to:

lapply( list(speed="speed",dist= "dist"),
        function(x) do.call("as.data.frame", aggregate(cars[,x], list(class=cars$dist>50), FUN=summary) ) )

Expected Output would be something like:

$speed 
         V1       V2 V3
1: FALSE 12.96970  4 20
2:  TRUE 20.11765 14 25

$dist
         V1       V2 V3
1: FALSE 12.96970  4 20
2:  TRUE 20.11765 14 25
Confounded answered 10/11, 2014 at 12:50 Comment(2)
I want to get summary statistics in a wide format, possibly in a list per (grouped) variable. For that I tried the outer lapply, and therefore Svens' answer is also not the right one for me...Confounded
I group by (dist>50)Confounded
C
2

Building on Svens answer a combination of .SDcols, rbindlist, and outer and inner lapply did the trick. The inner lapply is necessary to access .SD.

lapply( list(speed="speed",dist= "dist"),
    function(x)  carsx[ , rbindlist(lapply(.SD, function(x) list(mean=mean(x), min=min(x), max=max(x)) )), 
                       .SDcols = x,by= (dist>50)] ) 

Result:

$speed
    dist     mean min max
1: FALSE 12.96970   4  20
2:  TRUE 20.11765  14  25

$dist
    dist     mean min max
1: FALSE 27.84848   2  50
2:  TRUE 72.35294  52 120
Confounded answered 11/11, 2014 at 8:25 Comment(0)
S
6

You can specify the columns with the .SDcols parameter:

carsx[ , lapply(.SD, function(x) c(mean(x), min(x), max(x))), 
      .SDcols = c("speed", "dist")]
#    speed   dist
# 1:  15.4  42.98
# 2:   4.0   2.00
# 3:  25.0 120.00

carsx[ , lapply(.SD, function(x) c(mean(x), min(x), max(x))), 
      .SDcols = "speed"]
#    speed
# 1:  15.4
# 2:   4.0
# 3:  25.0
Stannic answered 10/11, 2014 at 13:6 Comment(0)
C
2

Building on Svens answer a combination of .SDcols, rbindlist, and outer and inner lapply did the trick. The inner lapply is necessary to access .SD.

lapply( list(speed="speed",dist= "dist"),
    function(x)  carsx[ , rbindlist(lapply(.SD, function(x) list(mean=mean(x), min=min(x), max=max(x)) )), 
                       .SDcols = x,by= (dist>50)] ) 

Result:

$speed
    dist     mean min max
1: FALSE 12.96970   4  20
2:  TRUE 20.11765  14  25

$dist
    dist     mean min max
1: FALSE 27.84848   2  50
2:  TRUE 72.35294  52 120
Confounded answered 11/11, 2014 at 8:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.