Convert .CSV files to .DTA files in Python

Asked 10/10, 2013 at 12:32 Answered 15/4, 2014 at 9:0

I'm looking to automate the process of converting many .CSV files into .DTA files via Python. .DTA files is the filetype that is handled by the Stata Statistics language.

I have not been able to find a way to go about doing this, however.

The R language has write(.dta) which allows a dataFrame in R to be converted to a .dta file, and there is a port to the R language from Python via RPy, but I can't figure out how to use RPy to access the write(.dta) function in R.

Any ideas?

Rheumy answered 10/10, 2013 at 12:32 Comment(4)

Get a specification of the DTA file and parse the CSV accordignly? – Monoclinous 10/10, 2013 at 12:50

I don't seem to understand what does it matter here that it is a binary file as you can work with python on binary data just fine. – Monoclinous 10/10, 2013 at 13:13

@Parseltongue: have you thoroughly read the RPy docs? P.S. basically, does the question boil down to "How to write DFA files in R?"? – Pasha 10/10, 2013 at 13:19

https://mcmap.net/q/376443/-save-dta-files-in-python might be useful - have you tried? – Geddes 10/10, 2013 at 13:26

You need rpy2 for Python and also the foreign package installed in R. You do that by starting R and typing install.packages("foreign"). You can then quit R and go back to Python.

Then this:

import rpy2.robjects as robjects
robjects.r("require(foreign)")
robjects.r('x=read.csv("test.csv")')
robjects.r('write.dta(x,"test.dta")')

You can construct the string passed to robjects.r from Python variables if you want, something like:

robjects.r('x=read.csv("%s")' % fileName)

Geddes answered 10/10, 2013 at 13:25 Comment(0)

(copypasting from my answer to a previous question)

pandas DataFrame objects now have a "to_stata" method. So you can do for instance

import pandas as pd
df = pd.read_stata('my_data_in.dta')
df.to_stata('my_data_out.dta')

DISCLAIMER: the first step is quite slow (in my test, around 1 minute for reading a 51 MB dta - also see this question), and the second produces a file which can be way larger than the original one (in my test, the size goes from 51 MB to 111MB). Spacedman's answer may look less elegant, but it is probably more efficient.

Pornography answered 15/4, 2014 at 9:0 Comment(4)

Warning to those unfamiliar with Stata: Be aware that the .dta format is not a constant, but dependent on version of Stata. Stata X can read .dta files for version X or lower, but it cannot necessarily read .dta files for higher versions. The format has changed about every 2 versions on average, so about once per 4 years. There is documentation. It's my impression that R is responsive to these changes, so going through R would usually be a good solution. I can't comment on Pandas. – Baryram 15/4, 2014 at 9:31

@NickCox true. I can only say that pandas was able to open a version later than X (don't know which one, but my STATA X was not able to open it), and then the exported dta could be opened with STATA X. – Pornography 15/4, 2014 at 15:20

Sounds good for you, except if the conversion process is downgrading the data and creating inconsistencies between you and other people using the "same" data. Unlikely, but watch out. As in my comment, correct program name is Stata. – Baryram 15/4, 2014 at 15:31

Yep, Stata, sorry. In my case, I verified all my results were reproducible as with the original. That said, the source code does warn for a couple of "NOT IMPLEMENTED" (minor, as far as I can judge) features: github.com/pydata/pandas/blob/master/pandas/io/stata.py – Pornography 16/4, 2014 at 21:40

Recommended topics

Hot tags