R: How to quickly read large .dta files without RAM Limitations
Asked Answered
F

3

8

I have a 10 GB .dta Stata file and I am trying to read it into 64-bit R 3.3.1. I am working on a virtual machine with about 130 GB of RAM (4 TB HD) and the .dta file is about 3 million rows and somewhere between 400 and 800 variables.

I know data.table() is the fastest way to read in .txt and .csv files, but does anyone have a recommendation for reading largeish .dta files into R? Reading the file into Stata as a .dta file requires about 20-30 seconds, although I need to set my working memory max prior to opening the file (I set the max at 100 GB).

I have not tried importing to .csv in Stata, but I hope to avoid touching the file with Stata. A solution is found via Using memisc to import stata .dta file into R but this assumes RAM is scarce. In my case, I should have sufficient RAM to work with the file.

Freckle answered 8/8, 2016 at 2:40 Comment(4)
If you are comfortable with python, you could convert your dta file to a csv file. The SO link Convert Stata .dta file to CSV without Stata software describes this in one of the answers (not the top answer).Therein
If you have enough ram, foreign::read.dta() should work, but it doesn't work on the latest stata format.Iz
Perhaps I should have articulate better: the goal is to use R and do it QUICKLY. Read.dta is incredibly slow and I'm hoping to avoid converting the file to .csv.Freckle
It's still conceivable that dta -> csv -> data.table would be your fastest option (although I hope not). If I were you I'd look through the results of library(sos); findFn("stata dta") and benchmark on a reasonable (1GB?) size subset.Gwendagwendolen
F
5

The fastest way to load a large Stata dataset in R is using the readstata13 package. I have compared the performance of foreign, readstata13, and haven packages on a large dataset in this post and the results repeatedly showed that readstata13 is the fastest available package for reading Stata dataset in R.

Fuchs answered 12/8, 2016 at 18:25 Comment(0)
A
5

Since this post is the top of the search results, I re-ran the benchmarking on the current version of haven and readstata13. It seems that both packages at this point are comparable, and haven is slightly better. In terms of time-complexity, they both approximate linear as a function of number of lines.

plot showing run times of both of the packages, along with a best-fit line

Here is the code to run the benchmark:

sizes <- 10^(seq(2, 7, .5))

benchmark_read <- function(n_rows){
start_t_haven <- Sys.time()
maisanta_dataset <- read_dta("my_large_file.dta"), n_max = n_rows)
end_t_haven <- Sys.time()

start_t_readstata13 <- Sys.time()
maisanta_dataset <- read.dta13("my_large_file.dta", select.rows = n_rows)
end_t_readstata13 <- Sys.time()

tibble(size = n_rows, 
       haven_time = end_t_haven - start_t_haven, 
       readstata13_time = end_t_readstata13 - start_t_readstata13) %>% 
  return()
}

benchmark_results <-
lapply(sizes, benchmark_read) %>% 
  bind_rows()
Avid answered 25/9, 2021 at 4:47 Comment(0)
G
2

I recommend the haven R package. Unlike foreign, It can read the latest Stata formats:

library(haven)
data <- read_dta('myfile.dta')

Not sure how fast it is compared to other options, but your choices for reading Stata files in R are rather limited. My understanding is that haven wraps a C library, so it's probably your fastest option.

Glazier answered 8/8, 2016 at 14:42 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.