I have a data frame with 10 columns, collecting actions of "users", where one of the columns contains an ID (not unique, identifying user)(column 10). the length of the data frame is about 750000 rows. I am trying to extract individual data frames (so getting a list or vector of data frames) split by the column containing the "user" identifier, to isolate the actions of a single actor.
ID | Data1 | Data2 | ... | UserID
1 | aaa | bbb | ... | u_001
2 | aab | bb2 | ... | u_001
3 | aac | bb3 | ... | u_001
4 | aad | bb4 | ... | u_002
resulting into
list(
ID | Data1 | Data2 | ... | UserID
1 | aaa | bbb | ... | u_001
2 | aab | bb2 | ... | u_001
3 | aac | bb3 | ... | u_001
,
4 | aad | bb4 | ... | u_002
...)
The following works very well for me on a small sample (1000 rows):
paths = by(smallsampleMat, smallsampleMat[,"userID"], function(x) x)
and then accessing the element I want by paths[1] for instance.
When applying on the original large data frame or even a matrix representation, this chokes my machine ( 4GB RAM, MacOSX 10.6, R 2.15) and never completes (I know that a newer R version exists, but I believe this is not the main problem).
It seems that split is more performant and after a long time completes, but I do not know ( inferior R knowledge) how to piece the resulting list of vectors into a vector of matrices.
path = split(smallsampleMat, smallsampleMat[,10])
I have considered also using big.matrix
etc, but without much success that would speed up the process.
dlply(df, .(userid))
and found that it is bad compared tosplit
even without involving the run time ofrequire(plyr)
, thank you and OP! – Bryonbryony