R: selecting subset without copying
Asked Answered
O

1

13

Is there a way to select a subset from objects (data frames, matrices, vectors) without making a copy of selected data?

I work with quite large data sets, but never change them. However often for convenience I select subsets of the data to operate on. Making a copy of a large subset each time is very memory inefficient, but both normal indexing and subset (and thus xapply() family of functions) create copies of selected data. So I'm looking for functions or data structures that can overcome this issue.

Some possible approaches that may fit my needs and hopefully are implemented in some R packages:

  • copy-on-write mechanism, i.e. data structures that are copied only when you add or rewrite existing elements;
  • immutable data structures, that only require recreating indexing information for the data structure, but not its content (like making substring from the string by only creating small object that holds length and a pointer to the same char array);
  • xapply() analogues that do not create subsets.
Osmo answered 5/3, 2012 at 19:55 Comment(6)
I think you should look at the data.table package (someone will presumably show up here shortly to give you more details ...)Fidelia
The database interfaces are clearly what you should be investigating. Pretty much every thing in r is pass-by-promise which becomes effectively pass-by-value at the moment anything needs to be done to the subset.Tran
@BenBolker: thanks, data.table seems to be nice package, but unfortunately it doesn't fit my needs in most cases. In particular, data.table has another indexing model and makes it much harder (and slower) to perform selection like data[1:50, 1:10] (i.e. selection by both - row & column) and many linear algebra operations. I was thinking of using matrices instead of my data frames to save both space and time, but matrices have their limitations too, so I'm looking for alternative options too.Osmo
@DWin: concerning "pass-by-value". Do you mean that R uses lazy evaluation? But it doesn't correspond to what I see: code DF <- data[1:10000, ] takes about 30 seconds, which is much longer than is needed to create promise object. Also this means that data structures have to be permanent not to break language semantics, but they are not. Can you explain it, please? I definitely miss something. (Let me know if it's worth to post it as a separate question.)Osmo
Not sure what you mean by permanent. It's perfectly possible to create an object inside a function's environment and have it automatically become garbage-collectible on exit. My understanding is that R does use lazy evaluation. You can use the delayedAssign and the force functions if you want some control over this.process. Most of us do not think much about it (until it bites us during function evaluations.)Tran
@DWin: permanent data structures are those that can't be modified in-place. If you want to update it, you have to create a modified copy (that normally shares most elements with original data structure). Thanks for this lazy evaluation stuff - I came from functional programming and things like promises and force make programming in R even more pleasure!Osmo
T
7

Try package ref. Specifically, its refdata class.

What you might be missing about data.table is that when grouping (by= parameter) the subsets of data are not copied, so that's fast. [Well technically they are but into a shared area of memory which is reused for each group, and copied using memcpy which is much faster than R's for loops in C.]

:= in data.table is one way to modify a data.table in place. data.table departs from usual R programming style in that it is not copied-on-write. User has to call copy() explicitly to copy a (potentially very large) table, even within a function.

You're right that there isn't a mechanism like refdata built into data.table. I see what you mean and it would be a nice feature. refdata should work on a data.table, though, and you might be fine with data.frame (but be sure to monitor copies with tracemem(DF)).

There is also idata.frame (immutable data.frame) in package plyr you could try.

Theomorphic answered 6/3, 2012 at 13:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.