To give as much context as I can / is needed, I'm trying to pull some data stored on a remote postgres server (heroku) into a pandas DataFrame, using psycopg2 to connect.
I'm interested in two specific tables, users and events, and the connection works fine, because when pulling down the user data
import pandas.io.sql as sql
# [...]
users = sql.read_sql("SELECT * FROM users", conn)
after waiting a few seconds, the DataFrame is returned as expected.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 67458 entries, 0 to 67457
Data columns (total 35 columns): [...]
Yet when trying to pull the bigger, heavier events data straight from ipython, after a long time, it just crashes:
In [11]: events = sql.read_sql("SELECT * FROM events", conn)
vagrant@data-science-toolbox:~$
and when trying from an iPython notebook I get the Dead kernel error
The kernel has died, would you like to restart it? If you do not restart the kernel, you will be able to save the notebook, but running code will not work until the notebook is reopened.
Update #1:
To get a better idea of the size of the events table I'm trying to pull in, here are the number of records and the number of attributes for each:
In [11]: sql.read_sql("SELECT count(*) FROM events", conn)
Out[11]:
count
0 2711453
In [12]: len(sql.read_sql("SELECT * FROM events LIMIT 1", conn).columns)
Out[12]: 18
Update #2:
Memory is definitely a bottleneck for the current implementation of read_sql
: when pulling down the events and trying to run another instance of iPython the result is
vagrant@data-science-toolbox:~$ sudo ipython
-bash: fork: Cannot allocate memory
Update #3:
I first tried with a read_sql_chunked
implementation that would just return the array of partial DataFrames:
def read_sql_chunked(query, conn, nrows, chunksize=1000):
start = 0
dfs = []
while start < nrows:
df = pd.read_sql("%s LIMIT %s OFFSET %s" % (query, chunksize, start), conn)
start += chunksize
dfs.append(df)
print "Events added: %s to %s of %s" % (start-chunksize, start, nrows)
# print "concatenating dfs"
return dfs
event_dfs = read_sql_chunked("SELECT * FROM events", conn, events_count, 100000)
and that works well, but when trying to concatenate the DataFrames, the kernel dies again.
And this is after giving the VM 2GB of RAM.
Based on Andy's explanation of read_sql
vs. read_csv
difference in implementation and performance, the next thing I tried was to append the records into a CSV and then read them all into a DataFrame:
event_dfs[0].to_csv(path+'new_events.csv', encoding='utf-8')
for df in event_dfs[1:]:
df.to_csv(path+'new_events.csv', mode='a', header=False, encoding='utf-8')
Again, the writing to CSV completes successfully – a 657MB file – but reading from the CSV never completes.
How can one approximate how much RAM would be sufficient to read say a 657MB CSV file, since 2GB seem not to be enough?
Feels like I'm missing some fundamental understanding of either DataFrames or psycopg2, but I'm stuck, I can't even pinpoint the bottleneck or where to optimize.
What's the proper strategy to pull larger amounts of data from a remote (postgres) server?
read_hdf
/HDFStore
) would make it easy and fast to query subsets of data on demand if it is too big to pull all in memory at once (much faster than sql, and possible to query subset as opposed to csv) – Maxim