Reading in only part of a Stata .DTA file in R

Asked 11/4, 2011 at 12:50 Answered 12/4, 2011 at 13:37

I apologize in advance if this has a simple answer somewhere. It seems like the kind of thing that would, but I can't seem to locate it in the help files, by searching SO, or by Googling.

I'm working with some datasets that are several GB right now. It's enough to fit in memory on one of the cluster nodes I have access to, but takes quite a bit of time to load. For many debugging/programming activities with this data, I don't need the entire file loaded, just the first few thousand observations to have a dataset on which to test code. I can of course just read the whole file in and subset, but I was wondering if there's a way to tell read.dta() to only read in the first N rows? This would of course be much faster.

I could also use a proper format like .csv and then use read.csv()'s nrows argument, but then I'd lose the factor labels in the Stata dataset (and have to recreate quite a few GB of data from someone else's code that's feeding in to this project. So a direct solution on .dta files is preferred.

Kings answered 11/4, 2011 at 12:50 Comment(1)

It might be worthwhile pointing your stata-using-colleague in the direction of the outsheet function for exporting to CSV. A little late for this project perhaps, but it might make it easier next time you work together. ats.ucla.edu/stat/stata/faq/outsheet.htm – Hysteric 11/4, 2011 at 17:10

Stata's binary files are written row-by-row, so you could change the R_LoadStataData function in stataread.c to limit the number of rows read in. However, this will only work if you do not need the value labels because they are written at the end of the file and would require you to read the entire file--which wouldn't save any time.

Dropsy answered 11/4, 2011 at 13:45 Comment(0)

That's going to be a difficult one, as the do_readStata function under the hood is compiled code, only capable of taking in the whole file. I believe that in general binary files are hard to read line by line, and .dta is a binary format. Also the native binary format of R doesn't allow to select a number of lines from the dataset while reading in.

In my humble opinion, you can better just create a set of test files from within Stata ( eg the Stata code sample 1000, count will give you a sample of 1000 observations from the loaded dataset), and work with them. And if you have no access to Stata, someone else in the project should be able to do that for you.

Swamp answered 11/4, 2011 at 13:12 Comment(2)

Bummer, but thanks. I imagine it's theoretically possible, though, because you can do it in Stata with something like use myfile.dta in 1/1000. I try to stick to R as much as possible, but I may just go Stata-ize the test sets. – Kings 11/4, 2011 at 17:57

@gsk3 : it is possible if you hack into the source of the foreign package, as Joshua explained, but you need to find a way to read the end of the file as well to get the labels. – Swamp 11/4, 2011 at 21:21

To follow up on Joris Meys: For this kind of thing, I use a "test" data set and the "real" data set, each in separate folders. I keep a macro at the top of the .do file (with if/then statements below) to (1) take a sample of the data and (2) point input/output to the right folder containing one or the other. I probably do it different for every project, but something like this:

data creation .do file

blah blah blah 
save                  using data/myfile.dta
save if uniform()<.05 using test_data/myfile.dta   // or bsample, then save for panel data

analysis .do file

local test = "test_"   
// when you're ready to run the file with all the data, use the following 
// local test = ""

use `test'data/myfile.dta
blah blah blah 
outreg2 ... using `test'output/mytable.txt

Sleave answered 12/4, 2011 at 13:37 Comment(0)

Recommended topics

Hot tags