R: How to quickly read large .dta files without RAM Limitations

Asked 8/8, 2016 at 2:40 Answered 25/9, 2021 at 4:47

I have a 10 GB .dta Stata file and I am trying to read it into 64-bit R 3.3.1. I am working on a virtual machine with about 130 GB of RAM (4 TB HD) and the .dta file is about 3 million rows and somewhere between 400 and 800 variables.

I know data.table() is the fastest way to read in .txt and .csv files, but does anyone have a recommendation for reading largeish .dta files into R? Reading the file into Stata as a .dta file requires about 20-30 seconds, although I need to set my working memory max prior to opening the file (I set the max at 100 GB).

I have not tried importing to .csv in Stata, but I hope to avoid touching the file with Stata. A solution is found via Using memisc to import stata .dta file into R but this assumes RAM is scarce. In my case, I should have sufficient RAM to work with the file.

Freckle answered 8/8, 2016 at 2:40 Comment(4)

If you are comfortable with python, you could convert your dta file to a csv file. The SO link Convert Stata .dta file to CSV without Stata software describes this in one of the answers (not the top answer). – Therein 8/8, 2016 at 2:51

If you have enough ram, foreign::read.dta() should work, but it doesn't work on the latest stata format. – Iz 8/8, 2016 at 3:6

Perhaps I should have articulate better: the goal is to use R and do it QUICKLY. Read.dta is incredibly slow and I'm hoping to avoid converting the file to .csv. – Freckle 8/8, 2016 at 4:51

It's still conceivable that dta -> csv -> data.table would be your fastest option (although I hope not). If I were you I'd look through the results of library(sos); findFn("stata dta") and benchmark on a reasonable (1GB?) size subset. – Gwendagwendolen 8/8, 2016 at 14:58

The fastest way to load a large Stata dataset in R is using the readstata13 package. I have compared the performance of foreign, readstata13, and haven packages on a large dataset in this post and the results repeatedly showed that readstata13 is the fastest available package for reading Stata dataset in R.

Fuchs answered 12/8, 2016 at 18:25 Comment(0)

Since this post is the top of the search results, I re-ran the benchmarking on the current version of haven and readstata13. It seems that both packages at this point are comparable, and haven is slightly better. In terms of time-complexity, they both approximate linear as a function of number of lines.

Here is the code to run the benchmark:

sizes <- 10^(seq(2, 7, .5))

benchmark_read <- function(n_rows){
start_t_haven <- Sys.time()
maisanta_dataset <- read_dta("my_large_file.dta"), n_max = n_rows)
end_t_haven <- Sys.time()

start_t_readstata13 <- Sys.time()
maisanta_dataset <- read.dta13("my_large_file.dta", select.rows = n_rows)
end_t_readstata13 <- Sys.time()

tibble(size = n_rows, 
       haven_time = end_t_haven - start_t_haven, 
       readstata13_time = end_t_readstata13 - start_t_readstata13) %>% 
  return()
}

benchmark_results <-
lapply(sizes, benchmark_read) %>% 
  bind_rows()

Avid answered 25/9, 2021 at 4:47 Comment(0)

I recommend the haven R package. Unlike foreign, It can read the latest Stata formats:

library(haven)
data <- read_dta('myfile.dta')

Not sure how fast it is compared to other options, but your choices for reading Stata files in R are rather limited. My understanding is that haven wraps a C library, so it's probably your fastest option.

Glazier answered 8/8, 2016 at 14:42 Comment(0)

Recommended topics

Hot tags