I am loading about 2 - 2.5 million records into a Postgres database every day.
I then read this data with pd.read_sql to turn it into a dataframe and then I do some column manipulation and some minor merging. I am saving this modified data as a separate table for other people to use.
When I do pd.to_sql it takes forever. If I save a csv file and use COPY FROM in Postgres, the whole thing only takes a few minutes but the server is on a separate machine and it is a pain to transfer files there.
Using psycopg2, it looks like I can use copy_expert to benefit from the bulk copying, but still use python. I want to, if possible, avoid writing an actual csv file. Can I do this in memory with a pandas dataframe?
Here is an example of my pandas code. I would like to add the copy_expert or something to make saving this data much faster if possible.
for date in required_date_range:
df = pd.read_sql(sql=query, con=pg_engine, params={'x' : date})
...
do stuff to the columns
...
df.to_sql('table_name', pg_engine, index=False, if_exists='append', dtype=final_table_dtypes)
Can someone help me with example code? I would prefer to use pandas still and it would be nice to do it in memory. If not, I will just write a csv temporary file and do it that way.
Edit- here is my final code which works. It only takes a couple of hundred seconds per date (millions of rows) instead of a couple of hours.
to_sql = """COPY %s FROM STDIN WITH CSV HEADER"""
def process_file(conn, table_name, file_object):
fake_conn = cms_dtypes.pg_engine.raw_connection()
fake_cur = fake_conn.cursor()
fake_cur.copy_expert(sql=to_sql % table_name, file=file_object)
fake_conn.commit()
fake_cur.close()
#after doing stuff to the dataframe
s_buf = io.StringIO()
df.to_csv(s_buf)
process_file(cms_dtypes.pg_engine, 'fact_cms_employee', s_buf)
s_buf = io.StringIO()
,df.to_csv(s_buf)
, which will store your df in a file-like buffer. Then maybecur.copy_from(s_buf,...)
instead ofcopy_expert
. – Interlineate