How about :
> DT[, mean(b), keyby=cut(a,quantile(a,probs=seq(.1,.9,.1)))]
cut V1
1: NA -0.31359818
2: (-1.05,-0.614] -0.14103182
3: (-0.614,-0.375] -0.33474492
4: (-0.375,-0.0767] 0.20827735
5: (-0.0767,0.114] 0.14890251
6: (0.114,0.377] 0.16685304
7: (0.377,0.581] 0.07086979
8: (0.581,0.771] 0.17950572
9: (0.771,1.18] -0.04951607
To have a look at that NA (and to check the results anyway), I next did :
> DT[, list(mean(b),.N,list(a)), keyby=cut(a,quantile(a,probs=seq(.1,.9,.1)))]
cut V1 N V3
1: NA -0.31359818 20 1.59528080213779,1.51178116845085,-2.2146998871775,-1.98935169586337,-1.47075238389927,1.35867955152904,
2: (-1.05,-0.614] -0.14103182 10 -0.626453810742332,-0.835628612410047,-0.820468384118015,-0.621240580541804,-0.68875569454952,-0.70749515696212,
3: (-0.614,-0.375] -0.33474492 10 -0.47815005510862,-0.41499456329968,-0.394289953710349,-0.612026393250771,-0.443291873218433,-0.589520946188072,
4: (-0.375,-0.0767] 0.20827735 10 -0.305388387156356,-0.155795506705329,-0.102787727342996,-0.164523596253587,-0.253361680136508,-0.112346212150228,
5: (-0.0767,0.114] 0.14890251 10 -0.0449336090152309,-0.0161902630989461,0.0745649833651906,-0.0561287395290008,-0.0538050405829051,-0.0593133967111857,
6: (0.114,0.377] 0.16685304 10 0.183643324222082,0.329507771815361,0.36458196213683,0.341119691424425,0.188792299514343,0.153253338211898,
7: (0.377,0.581] 0.07086979 10 0.487429052428485,0.575781351653492,0.389843236411431,0.417941560199702,0.387671611559369,0.556663198673657,
8: (0.581,0.771] 0.17950572 10 0.738324705129217,0.593901321217509,0.61982574789471,0.763175748457544,0.696963375404737,0.768532924515416,
9: (0.771,1.18] -0.04951607 10 1.12493091814311,0.943836210685299,0.821221195098089,0.918977371608218,0.782136300731067,1.10002537198388,
Aside: I've returned a list
column (each cell is itself a vector) there to have a quick look at the values going into the bins, just to check. data.table
displays commas when printing (and shows just the first 6 items per cell), but each cell of V3
there is actually a numeric vector.
So the values outside the first and last break
are being coded together as NA. It's not obvious to me how to tell cut
not to do that. So I just added -Inf and +Inf :
> DT[,list(mean(b),.N),keyby=cut(a,c(-Inf,quantile(a,probs=seq(.1,.9,.1)),+Inf))]
cut V1 N
1: (-Inf,-1.05] -0.16938368 10
2: (-1.05,-0.614] -0.14103182 10
3: (-0.614,-0.375] -0.33474492 10
4: (-0.375,-0.0767] 0.20827735 10
5: (-0.0767,0.114] 0.14890251 10
6: (0.114,0.377] 0.16685304 10
7: (0.377,0.581] 0.07086979 10
8: (0.581,0.771] 0.17950572 10
9: (0.771,1.18] -0.04951607 10
10: (1.18, Inf] -0.45781268 10
That's better. Or alternatively :
> DT[, list(mean(b),.N), keyby=cut(a,quantile(a,probs=seq(0,1,.1)),include=TRUE)]
cut V1 N
1: [-2.21,-1.05] -0.16938368 10
2: (-1.05,-0.614] -0.14103182 10
3: (-0.614,-0.375] -0.33474492 10
4: (-0.375,-0.0767] 0.20827735 10
5: (-0.0767,0.114] 0.14890251 10
6: (0.114,0.377] 0.16685304 10
7: (0.377,0.581] 0.07086979 10
8: (0.581,0.771] 0.17950572 10
9: (0.771,1.18] -0.04951607 10
10: (1.18,2.4] -0.45781268 10
That way you see what the min and max is, rather than it displaying -Inf and +Inf. Notice you need to pass include=TRUE
to cut
otherwise 11 bins will be returned with only 1 in the first.