Do table slices take up memory in R?

I know this doesn't actually answer the main thrust of the question (@hadley has done that and deserves credit), but there are other options to those you suggest. Here you could use pmin() and pmax() as another solution, and using with() or within() we can do it without explicit subsetting to create a dd.

R> set.seed(1)
R> dat <- data.frame(depth1 = runif(10), depth2 = runif(10))
R> dat <- within(dat, mindepth <- pmin(depth1, depth2))
R> dat <- within(dat, maxdepth <- pmax(depth1, depth2))
R> 
R> dat
       depth1    depth2   mindepth  maxdepth
1  0.26550866 0.2059746 0.20597457 0.2655087
2  0.37212390 0.1765568 0.17655675 0.3721239
3  0.57285336 0.6870228 0.57285336 0.6870228
4  0.90820779 0.3841037 0.38410372 0.9082078
5  0.20168193 0.7698414 0.20168193 0.7698414
6  0.89838968 0.4976992 0.49769924 0.8983897
7  0.94467527 0.7176185 0.71761851 0.9446753
8  0.66079779 0.9919061 0.66079779 0.9919061
9  0.62911404 0.3800352 0.38003518 0.6291140
10 0.06178627 0.7774452 0.06178627 0.7774452

We can look at how much copying goes on with tracemem() but only if your R was compiled with the following configure option activated --enable-memory-profiling.

R> set.seed(1)
R> dat <- data.frame(depth1 = runif(10), depth2 = runif(10))
R> tracemem(dat)
[1] "<0x2641cd8>"
R> dat <- within(dat, mindepth <- pmin(depth1, depth2))
tracemem[0x2641cd8 -> 0x2641a00]: within.data.frame within 
tracemem[0x2641a00 -> 0x2641878]: [<-.data.frame [<- within.data.frame within 
R> tracemem(dat)
[1] "<0x2657bc8>"
R> dat <- within(dat, maxdepth <- pmax(depth1, depth2))
tracemem[0x2657bc8 -> 0x2c765d8]: within.data.frame within 
tracemem[0x2c765d8 -> 0x2c764b8]: [<-.data.frame [<- within.data.frame within

So we see that R copied dat twice during each within() call. Compare that with your two suggestions:

R> set.seed(1)
R> dat <- data.frame(depth1 = runif(10), depth2 = runif(10))
R> tracemem(dat)
[1] "<0x2e1ddd0>"
R> dd <- dat[,c("depth1","depth2")]
R> tracemem(dd)
[1] "<0x2df01a0>"
R> dat$mindepth = apply(dd,1,min)
tracemem[0x2df01a0 -> 0x2cf97d8]: as.matrix.data.frame as.matrix apply 
tracemem[0x2e1ddd0 -> 0x2cc0ab0]: 
tracemem[0x2cc0ab0 -> 0x2cc0b20]: $<-.data.frame $<- 
tracemem[0x2cc0b20 -> 0x2cc0bc8]: $<-.data.frame $<- 
R> tracemem(dat)
[1] "<0x26b93c8>"
R> dat$maxdepth = apply(dd,1,max)
tracemem[0x2df01a0 -> 0x2cc0e30]: as.matrix.data.frame as.matrix apply 
tracemem[0x26b93c8 -> 0x26742c8]: 
tracemem[0x26742c8 -> 0x2674358]: $<-.data.frame $<- 
tracemem[0x2674358 -> 0x2674478]: $<-.data.frame $<-

Here, dd is copied once in each call to apply because apply() converts dd to a matrix before proceeding. The final three lines in the each block of tracemem output indicates three copies of dat are being made to insert the new column.

What about your second option?

R> set.seed(1)
R> dat <- data.frame(depth1 = runif(10), depth2 = runif(10))
R> tracemem(dat)
[1] "<0x268bc88>"
R> dat$mindepth <- apply(dat[,c("depth1","depth2")],1,min)
tracemem[0x268bc88 -> 0x26376b0]: 
tracemem[0x26376b0 -> 0x2637720]: $<-.data.frame $<- 
tracemem[0x2637720 -> 0x2637790]: $<-.data.frame $<- 
R> tracemem(dat)
[1] "<0x2466d40>"
R> dat$maxdepth <- apply(dat[,c("depth1","depth2")],1,max)
tracemem[0x2466d40 -> 0x22ae0d8]: 
tracemem[0x22ae0d8 -> 0x22ae1f8]: $<-.data.frame $<- 
tracemem[0x22ae1f8 -> 0x22ae318]: $<-.data.frame $<-

Here this version avoids the copy involved in setting up dd, but in all other respects is similar to your previous suggestion.

Can we do any better? Yes, and one simple way is to use the within() option I started with but execute both statements to create new mindepth and maxdepth variables in the one call to within():

R> set.seed(1)
R> dat <- data.frame(depth1 = runif(10), depth2 = runif(10))
R> tracemem(dat)
[1] "<0x21c4158>"
R> dat <- within(dat, { mindepth <- pmin(depth1, depth2)
+                      maxdepth <- pmax(depth1, depth2) })
tracemem[0x21c4158 -> 0x21c44a0]: within.data.frame within 
tracemem[0x21c44a0 -> 0x21c4628]: [<-.data.frame [<- within.data.frame within

In this version we only invoke two copies of dat compared to the 4 copies of the original within() version.

What about if we coerce dat to a matrix and then do the insertions?

R> set.seed(1)
R> dat <- data.frame(depth1 = runif(10), depth2 = runif(10))
R> tracemem(dat)
[1] "<0x1f29c70>"
R> mat <- as.matrix.data.frame(dat)
tracemem[0x1f29c70 -> 0x1f09768]: as.matrix.data.frame 
R> tracemem(mat)
[1] "<0x245ff30>"
R> mat <- cbind(mat, pmin(mat[,1], mat[,2]), pmax(mat[,1], mat[,2]))
R>

That is an improvement as we only incur the cost of the single copy of dat when coercing to a matrix. I cheated a bit by calling the as.matrix.data.frame() method directly. If we'd just used as.matrix() we'd have incurred another copy of mat.

This highlights one of the reasons why matrices are so much faster to use than data frames.

Recommended topics

Hot tags