What is the best format in which to save data frames to disc in R for storage?
Asked Answered
P

1

9

What is the best format to persist simple data frames to disc in R for storage while limiting semantic loss?

I ask because I'm archiving a data set. In an ideal world, my data format would have the follow characteristics:

  1. Stability - the storage format will be compatible with future version of R
  2. Semantic compatibility - the storage format will understand the semantics of R's primative data types. For example, it will be able to store ordered factors with labels in a sensible manner.
  3. Open standard - ideally, the format will be an open standard so other statistics packages (now or in the future) will be able to understand it

My first thought was to use CSV which is very stable, but lacks the semantic richness required. On the other hand, R's builtin RData format completely captures R's semantics, but seems likely to change between releases (correct me if I'm wrong).

Is there another format that finds a balance between these three imperatives?

Provence answered 9/3, 2013 at 6:43 Comment(3)
Does your data will be open or manipulated by another program than R? and ?save mention that Any recent version of R can read compressed save file so I doubt that .Rdata format can change between releases.Uro
Perhaps use XML or JSONLindahl
I think YAML is a good alternative, see package yaml. It can handle R's basic data types (e.g. named lists, vectors, ...) and is human-readable (in a better way than XML in my opinion).Brosine
G
4

Dump it to a text file with dput. That way you get all the structure of R's objects, and its in a text-based form that, should R stop existing, can be parsed fairly easily.

It probably doesn't pass (3), your 'open standard' test.

R is pretty good for backward compatibility with its .RData format, so even if the files written by the latest R aren't the same as older ones, the latest R will still read old files. However, if R should stop existing, reverse-engineering of the binary format is orders of magnitude harder than grokking the output from dput.

Graphitize answered 9/3, 2013 at 12:18 Comment(1)
Note that the R documentation for dput specifically says that it's not a good format for saving data between sessions: "[dput] is not a good way to transfer objects between R sessions. dump is better, but the function save is designed to be used for transporting R data, and will work with R objects that dput does not handle correctly as well as being much faster." rdocumentation.org/packages/base/versions/3.5.0/topics/dputUniversalist

© 2022 - 2024 — McMap. All rights reserved.