Saving a data frame as a binary file
Asked Answered
I

2

14

I would like to save a whole bunch of relatively large data frames while minimizing the space that the files take up. When opening the files, I need to be able to control what names they are given in the workspace.

Basically I'm looking for the symantics of dput and dget but with binary files.

Example:

n<-10000

for(i in 1:100){
    dat<-data.frame(a=rep(c("Item 1","Item 2"),n/2),b=rnorm(n),
        c=rnorm(n),d=rnorm(n),e=rnorm(n))
    dput(dat,paste("data",i,sep=""))
}


##much later


##extract 3 random data sets and bind them
for(i in 1:10){
    nums<-sample(1:100,3)
    comb<-rbind(dget(paste("data",nums[1],sep="")),
            dget(paste("data",nums[2],sep="")),
            dget(paste("data",nums[3],sep="")))
    ##do stuff here
}
Iberia answered 28/10, 2009 at 4:59 Comment(0)
W
23

Your best bet is to use rda files. You can use the save() and load() commands to write and read:

set.seed(101)
a = data.frame(x1=runif(10), x2=runif(10), x3=runif(10))

save(a, file="test.rda")
load("test.rda")

Edit: For completeness, just to cover what Harlan's suggestion might look like (i.e. wrapping the load command to return the data frame):

loadx <- function(x, file) {
  load(file)
  return(x)
}  

loadx(a, "test.rda")

Alternatively, have a look at the hdf5, RNetCDF and ncdf packages. I've experimented with the hdf5 package in the past; this uses the NCSA HDF5 library. It's very simple:

hdf5save(fileout, ...)
hdf5load(file, load = TRUE, verbosity = 0, tidy = FALSE)

A last option is to use binary file connections, but that won't work well in your case because readBin and writeBin only support vectors:

Here's a trivial example. First write some data with "w" and append "b" to the connection:

zz <- file("testbin", "wb")
writeBin(1:10, zz)
close(zz)

Then read the data with "r" and append "b" to the connection:

zz <- file("testbin", "rb")
readBin(zz, integer(), 4)
close(zz)
Workmanlike answered 28/10, 2009 at 11:46 Comment(5)
Nice answer Shane. I'd like to use 'save', but don't like the fact that I can't control the name of the data on loadingIberia
You could wrap the load() function in a new function that knows the name of the data in the file and renames it for a return value. The load function will insert the variables into the environment/namespace of the function.Searby
You can do what Harlan suggested, or you can just save one dataframe per file, and give both the file and dataframe the same name. Then you will have the same behavior as what you described above with dput and dget, right?Workmanlike
You have basically reinvented loadRDSRammer
You can pass a compress argument with a value of bzip2 or xz to save to use a more efficient compression algorithm. The default is gzip. The new command would be save(a, file="test.rda", compress="xz")Credence
Z
12

You may have a look at saveRDS and readRDS. They are functions for serialization.

x = data.frame(x1=runif(10), x2=runif(10), x3=runif(10))

saveRDS(x, file="myDataFile.rds")
x <- readRDS(file="myDataFile.rds")
Zoography answered 28/10, 2009 at 23:54 Comment(4)
Out of curiosity: why would someone use these over save/load? Is there some particular benefit?Workmanlike
In 2.13 they are no longer internal. You use them when you want to save a single object, not multiple objects like save()Rammer
I get: Error: could not find function "readRDS", same for saveRDS. What library needs to be loaded?Keith
mohawkjohn - they are part of base R, no need to load anything in order to use them.Roswald

© 2022 - 2024 — McMap. All rights reserved.