Large, persistent DataFrame in pandas
Asked Answered
G

6

100

I am exploring switching to python and pandas as a long-time SAS user.

However, when running some tests today, I was surprised that python ran out of memory when trying to pandas.read_csv() a 128mb csv file. It had about 200,000 rows and 200 columns of mostly numeric data.

With SAS, I can import a csv file into a SAS dataset and it can be as large as my hard drive.

Is there something analogous in pandas?

I regularly work with large files and do not have access to a distributed computing network.

Galumph answered 24/7, 2012 at 0:50 Comment(1)
I'm not familiar with pandas, but you might want to look through iterating through the file. pandas.pydata.org/pandas-docs/stable/…Outofdate
P
84

In principle it shouldn't run out of memory, but there are currently memory problems with read_csv on large files caused by some complex Python internal issues (this is vague but it's been known for a long time: http://github.com/pydata/pandas/issues/407).

At the moment there isn't a perfect solution (here's a tedious one: you could transcribe the file row-by-row into a pre-allocated NumPy array or memory-mapped file--np.mmap), but it's one I'll be working on in the near future. Another solution is to read the file in smaller pieces (use iterator=True, chunksize=1000) then concatenate then with pd.concat. The problem comes in when you pull the entire text file into memory in one big slurp.

Policeman answered 24/7, 2012 at 1:10 Comment(3)
Say I can read the file and concat all of them together into one DataFrame. Does the DataFrame have to reside in memory? With SAS, I can work with datasets of any size as long as I have the hard-drive space. Is it the same with DataFrames? I get the impression they are constrained by RAM and not hard-drive space. Sorry for the noob question and thanks for you help. I'm enjoying your book.Galumph
Right, you're constrained by RAM. SAS indeed has much better support for "out-of-core" big data processing.Policeman
@WesMcKinney These workarounds shouldn't be needed any longer, because of the new csv loader you landed in 0.10, right?Saurian
D
89

Wes is of course right! I'm just chiming in to provide a little more complete example code. I had the same issue with a 129 Mb file, which was solved by:

import pandas as pd
    
# Returns a TextFileReader, which is iterable with chunks of 1000 rows.
csv_iterator = pd.read_csv('large_dataset.csv', iterator=True, chunksize=1000)
# Iterate through the dataframe chunks and print one row/record at a time
for chunk in csv_iterator:
    for index, row in chunk.iterrows():
        print(row)

# df is DataFrame. If errors, use `list(csv_iterator)` instead
df = pd.concat(tp, ignore_index=True)
Despicable answered 30/8, 2012 at 8:57 Comment(11)
I think you can just do df = concate(tp, ignore_index=True) ?Disyllable
@smci Tried this quickly with the same data repeated x4 (550 Mb) or x8 (1.1Gb). Interestingly, with or without [x for x in tp], the x4 went through fine, and x8 crashed in a MemoryError.Despicable
@Despicable Are you using ipython / is the dataframe is persisting in memory rather than being garbage collected?Disyllable
@AndyHayden Python 2.7.3 (default, Aug 1 2012, 05:14:39) [GCC 4.6.3] on linux2. How could the data be garbage collected - I still have it stored in a dataframe?Despicable
@Despicable Sorry not sure I understand the issue then :)Disyllable
I get this error while using it: AssertionError: first argument must be a list-like of pandas objects, you passed an object of type "TextFileReader". Any idea what is happening here?Firsthand
This bug will be fixed in 0.14 (release soon), github.com/pydata/pandas/pull/6941; workaround for < 0.14.0 is to do pd.concat(list(tp), ignore_index=True)O
what if the values are strings or categorical - i am getting the error: incompatible categories in categorical concatIdou
This was really really helpful. Thanks. Was stuck on it for quite some time.Orozco
why does ignore_index need to be true?Decane
HI .. May also request for a solution for a similar problem that I am facing #46490974Billhead
P
84

In principle it shouldn't run out of memory, but there are currently memory problems with read_csv on large files caused by some complex Python internal issues (this is vague but it's been known for a long time: http://github.com/pydata/pandas/issues/407).

At the moment there isn't a perfect solution (here's a tedious one: you could transcribe the file row-by-row into a pre-allocated NumPy array or memory-mapped file--np.mmap), but it's one I'll be working on in the near future. Another solution is to read the file in smaller pieces (use iterator=True, chunksize=1000) then concatenate then with pd.concat. The problem comes in when you pull the entire text file into memory in one big slurp.

Policeman answered 24/7, 2012 at 1:10 Comment(3)
Say I can read the file and concat all of them together into one DataFrame. Does the DataFrame have to reside in memory? With SAS, I can work with datasets of any size as long as I have the hard-drive space. Is it the same with DataFrames? I get the impression they are constrained by RAM and not hard-drive space. Sorry for the noob question and thanks for you help. I'm enjoying your book.Galumph
Right, you're constrained by RAM. SAS indeed has much better support for "out-of-core" big data processing.Policeman
@WesMcKinney These workarounds shouldn't be needed any longer, because of the new csv loader you landed in 0.10, right?Saurian
P
45

This is an older thread, but I just wanted to dump my workaround solution here. I initially tried the chunksize parameter (even with quite small values like 10000), but it didn't help much; had still technical issues with the memory size (my CSV was ~ 7.5 Gb).

Right now, I just read chunks of the CSV files in a for-loop approach and add them e.g., to an SQLite database step by step:

import pandas as pd
import sqlite3
from pandas.io import sql
import subprocess

# In and output file paths
in_csv = '../data/my_large.csv'
out_sqlite = '../data/my.sqlite'

table_name = 'my_table' # name for the SQLite database table
chunksize = 100000 # number of lines to process at each iteration

# columns that should be read from the CSV file
columns = ['molecule_id','charge','db','drugsnow','hba','hbd','loc','nrb','smiles']

# Get number of lines in the CSV file
nlines = subprocess.check_output('wc -l %s' % in_csv, shell=True)
nlines = int(nlines.split()[0]) 

# connect to database
cnx = sqlite3.connect(out_sqlite)

# Iteratively read CSV and dump lines into the SQLite table
for i in range(0, nlines, chunksize):

    df = pd.read_csv(in_csv,  
            header=None,  # no header, define column header manually later
            nrows=chunksize, # number of rows to read at each iteration
            skiprows=i)   # skip rows that were already read

    # columns to read        
    df.columns = columns

    sql.to_sql(df, 
                name=table_name, 
                con=cnx, 
                index=False, # don't use CSV file index
                index_label='molecule_id', # use a unique column from DataFrame as index
                if_exists='append') 
cnx.close()    
Protuberancy answered 6/2, 2015 at 17:46 Comment(3)
Super useful to see a realistic use-case for the chunked reading feature. Thanks.Putative
Just a small remark, to this old topic: pandas.read_csv directly returns (at least on the version I'm currently using) an iterator if you simply provide iterator=True and chunksize=chunksize. Hence, you would just do a for loop over the pd.read_csv call, instead of re-instantiating it every time. However, this costs only the call overhead, there maybe no significant impact.Ortolan
Hi, Joel. Thanks for the note! The iterator=True and chunksize parameters already existed back then if I remember correctly. Maybe there was a bug in an older version which caused the memory blow-up -- I will give it another try next time I read a large DataFrame in Pandas (I am mostly using Blaze now for such tasks)Protuberancy
B
6

Below is my working flow.

import sqlalchemy as sa
import pandas as pd
import psycopg2

count = 0
con = sa.create_engine('postgresql://postgres:pwd@localhost:00001/r')
#con = sa.create_engine('sqlite:///XXXXX.db') SQLite
chunks = pd.read_csv('..file', chunksize=10000, encoding="ISO-8859-1",
                     sep=',', error_bad_lines=False, index_col=False, dtype='unicode')

Base on your file size, you'd better optimized the chunksize.

 for chunk in chunks:
        chunk.to_sql(name='Table', if_exists='append', con=con)
        count += 1
        print(count)

After have all data in Database, You can query out those you need from database.

Biagi answered 19/5, 2017 at 21:29 Comment(0)
D
4

If you want to load huge csv files, dask might be a good option. It mimics the pandas api, so it feels quite similar to pandas

link to dask on github

Darcee answered 8/6, 2017 at 21:54 Comment(1)
Thanks, since I posted this I've been using dask and the parquet format.Galumph
N
1

You can use Pytable rather than pandas df. It is designed for large data sets and the file format is in hdf5. So the processing time is relatively fast.

Nomi answered 10/2, 2017 at 17:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.