sas7bdat
worked fine for all but one of the files I was looking at (specifically, this one); in reporting the error to the sas7bdat
developer, Matthew Shotwell, he also pointed me in the direction of Hadley's haven
package in R which also has a read_sas
method.
This method is superior for two reasons:
1) It didn't have any trouble reading the above-linked file
2) It is much (I'm talking much) faster than read.sas7bdat
. Here's a quick benchmark (on this file, which is smaller than the others) for evidence:
microbenchmark(times=10L,
read.sas7bdat("psu97ai.sas7bdat"),
read_sas("psu97ai.sas7bdat"))
Unit: milliseconds
expr min lq mean median uq max neval cld
read.sas7bdat("psu97ai.sas7bdat") 66696.2955 67587.7061 71939.7025 68331.9600 77225.1979 82836.8152 10 b
read_sas("psu97ai.sas7bdat") 397.9955 402.2627 410.4015 408.5038 418.1059 425.2762 10 a
That's right--haven::read_sas
takes (on average) 99.5% less time than sas7bdat::read.sas7bdat
.
minor update
I previously wasn't able to figure out whether the two methods produced the same data (i.e., that both have equal levels of fidelity with respect to reading the data), but have finally done so:
# Keep as data.tables
sas7bdat <- setDT(read.sas7bdat("psu97ai.sas7bdat"))
haven <- setDT(read_sas("psu97ai.sas7bdat"))
# read.sas7bdat prefers strings as factors,
# and as of now has no stringsAsFactors argument
# with which to prevent this
idj_factor <- sapply(haven, is.factor)
# Reset all factor columns as characters
sas7bdat[ , (idj_factor) := lapply(.SD, as.character), .SDcols = idj_factor]
# Check equality of the tables
all.equal(sas7bdat, haven, check.attributes = FALSE)
# [1] TRUE
However, note that read.sas7bdat
has kept a massive list of attributes for the file, presumably a holdover from SAS:
str(sas7bdat)
# ...
# - attr(*, "column.info")=List of 70
# ..$ :List of 12
# .. ..$ name : chr "NCESSCH"
# .. ..$ offset: int 200
# .. ..$ length: int 12
# .. ..$ type : chr "character"
# .. ..$ format: chr "$"
# .. ..$ fhdr : int 0
# .. ..$ foff : int 76
# .. ..$ flen : int 1
# .. ..$ label : chr "UNIQUE SCHOOL ID (NCES ASSIGNED)"
# .. ..$ lhdr : int 0
# .. ..$ loff : int 44
# .. ..$ llen : int 32
# ...
So, if by any chance you need these attributes (I know some people are particularly keen on the label
s, for instance), perhaps read.sas7bdat
is the option for you after all.
sas7bdat
to apply formats, I just triedhaven
again and it gives me an error. If I must, I use a wrapper forHmisc::sas.get
to read a directory of sas data sets and return a list of data frames which, although it requires a working sas, has always worked for me – Ikedahaven
:) In such cases, thoughHmisc
requires a working SAS, knowing the alternatives is helpful. – Rabihaven
reads the files just fine. I need the formats also because the mountains of data I get from sas are largely unformatted. whenhaven
doesnt throw me a vague error, it doesn't really apply the formats--only keeps them as attributes requiring a little more user legwork--not much, not difficult but room for errors.Hmisc::sas.get
(and the wrapper fn I use) do all this in sas (optionally) and return the formatted data frame – Ikeda