Pandas read_stata() with large .dta files

Asked 2/11, 2013 at 17:9 Answered 27/10, 2018 at 15:45

I am working with a Stata .dta file that is around 3.3 gigabytes, so it is large but not excessively large. I am interested in using IPython and tried to import the .dta file using Pandas but something wonky is going on. My box has 32 gigabytes of RAM and attempting to load the .dta file results in all the RAM being used (after ~30 minutes) and my computer to stall out. This doesn't 'feel' right in that I am able to open the file in R using read.dta() from the foreign package no problem, and working with the file in Stata is fine. The code I am using is:

%time myfile = pd.read_stata(data_dir + 'my_dta_file.dta')

and I am using IPython in Enthought's Canopy program. The reason for the '%time' is because I am interested in benchmarking this against R's read.dta().

My questions are:

Is there something I am doing wrong that is resulting in Pandas having issues?
Is there a workaround to get the data into a Pandas dataframe?

Ayeshaayin answered 2/11, 2013 at 17:9 Comment(1)

Are you using 64bit machine. Could it be that your R is maybe 64 bit version and Python 32 bit? – Adaxial 5/11, 2013 at 12:28

Here is a little function that has been handy for me, using some pandas features that might not have been available when the question was originally posed:

def load_large_dta(fname):
    import sys

    reader = pd.read_stata(fname, iterator=True)
    df = pd.DataFrame()

    try:
        chunk = reader.get_chunk(100*1000)
        while len(chunk) > 0:
            df = df.append(chunk, ignore_index=True)
            chunk = reader.get_chunk(100*1000)
            print '.',
            sys.stdout.flush()
    except (StopIteration, KeyboardInterrupt):
        pass

    print '\nloaded {} rows'.format(len(df))

    return df

I loaded an 11G Stata file in 100 minutes with this, and it's nice to have something to play with if I get tired of waiting and hit cntl-c.

This notebook shows it in action.

Isogonic answered 7/8, 2015 at 4:33 Comment(1)

I tested this function and the method using read_stata chunksize (as suggested by Jinhua Wang) against using read_stata without using chunksize, on a dataset with 1.8m rows. For me, the without chunksize method took 5mins. When running the two optimisations (twice) the function method was faster for me both times (by 10 seconds the first time at ~3mins, by 60 seconds the second at ~2mins) – Hornback 17/1, 2022 at 23:21

There is a simpler way to solve it using Pandas' built-in function read_stata.

Assume your large file is named as large.dta.

import pandas as pd

reader=pd.read_stata("large.dta",chunksize=100000)

df = pd.DataFrame()

for itm in reader:
    df=df.append(itm)

df.to_csv("large.csv")

Iconoscope answered 27/10, 2018 at 15:45 Comment(0)

For all the people who end on this page, please upgrade Pandas to the latest version. I had this exact problem with a stalled computer during load (300 MB Stata file but only 8 GB system ram), and upgrading from v0.14 to v0.16.2 solved the issue in a snap.

Currently, it's v 0.16.2. There have been significant improvements to speed though I don't know the specifics. See: most efficient I/O setup between Stata and Python (Pandas)

Rozina answered 5/7, 2015 at 20:53 Comment(0)

Question 1.

There's not much I can say about this.

Question 2.

Consider exporting your .dta file to .csv using Stata command outsheet or export delimited and then using read_csv() in pandas. In fact, you could take the newly created .csv file, use it as input for R and compare with pandas (if that's of interest). read_csv is likely to have had more testing than read_stata.

Run help outsheet for details of the exporting.

Enumeration answered 3/11, 2013 at 6:28 Comment(0)

-2

You should not be reading a 3GB+ file into an in-memory data object, that's a recipe for disaster (and has nothing to do with pandas). The right way to do this is to mem-map the file and access the data as needed.

You should consider converting your file to a more appropriate format (csv or hdf) and then you can use the Dask wrapper around pandas DataFrame for chunk-loading the data as needed:

from dask import dataframe as dd
# If you don't want to use all the columns, make a selection
columns = ['column1', 'column2']
data = dd.read_csv('your_file.csv', use_columns=columns)

This will transparently take care of chunk-loading, multicore data handling and all that stuff.

Lunatic answered 24/8, 2015 at 13:7 Comment(1)

In your particular case, hdf would be an even better format than csv – Lunatic 24/8, 2015 at 13:13

Recommended topics

Hot tags