Turn pandas dataframe into a file-like object in memory?
Asked Answered
W

3

29

I am loading about 2 - 2.5 million records into a Postgres database every day.

I then read this data with pd.read_sql to turn it into a dataframe and then I do some column manipulation and some minor merging. I am saving this modified data as a separate table for other people to use.

When I do pd.to_sql it takes forever. If I save a csv file and use COPY FROM in Postgres, the whole thing only takes a few minutes but the server is on a separate machine and it is a pain to transfer files there.

Using psycopg2, it looks like I can use copy_expert to benefit from the bulk copying, but still use python. I want to, if possible, avoid writing an actual csv file. Can I do this in memory with a pandas dataframe?

Here is an example of my pandas code. I would like to add the copy_expert or something to make saving this data much faster if possible.

    for date in required_date_range:
        df = pd.read_sql(sql=query, con=pg_engine, params={'x' : date})
        ...
        do stuff to the columns
        ...
        df.to_sql('table_name', pg_engine, index=False, if_exists='append',  dtype=final_table_dtypes)

Can someone help me with example code? I would prefer to use pandas still and it would be nice to do it in memory. If not, I will just write a csv temporary file and do it that way.

Edit- here is my final code which works. It only takes a couple of hundred seconds per date (millions of rows) instead of a couple of hours.

to_sql = """COPY %s FROM STDIN WITH CSV HEADER"""

def process_file(conn, table_name, file_object):
    fake_conn = cms_dtypes.pg_engine.raw_connection()
    fake_cur = fake_conn.cursor()
    fake_cur.copy_expert(sql=to_sql % table_name, file=file_object)
    fake_conn.commit()
    fake_cur.close()


#after doing stuff to the dataframe
    s_buf = io.StringIO()
    df.to_csv(s_buf) 
    process_file(cms_dtypes.pg_engine, 'fact_cms_employee', s_buf)
Whitewall answered 5/7, 2016 at 12:56 Comment(3)
I don't know psycopg2 but you could try something like: s_buf = io.StringIO(), df.to_csv(s_buf), which will store your df in a file-like buffer. Then maybe cur.copy_from(s_buf,...) instead of copy_expert.Interlineate
The strongio worked! I still kept copy expert though. It only took like 100 seconds versus 10000 seconds when I was using just plain pandas.to_sql. Make a real answer so I can acceptWhitewall
Glad I could help.Interlineate
I
54

Python module io(docs) has necessary tools for file-like objects.

import io

# text buffer
s_buf = io.StringIO()

# saving a data frame to a buffer (same as with a regular file):
df.to_csv(s_buf)

Edit. (I forgot) In order to read from the buffer afterwards, its position should be set to the beginning:

s_buf.seek(0)

I'm not familiar with psycopg2 but according to docs both copy_expert and copy_from can be used, for example:

cur.copy_from(s_buf, table)

(For Python 2, see StringIO.)

Interlineate answered 5/7, 2016 at 22:0 Comment(3)
Thanks. Also, I am looping through each date to query for that date. Each time I am connecting to the database again. Is there a better way to not have to connect/reconnect each loop? Like I would connect once, and then I would just change the query during my loops?Whitewall
I'm not sure if I understand but can't you just query for the whole stuff (all dates in the range) once. If this is too large, then maybe query in chunks. I guess it would add another column with dates to the data frame. Then you could either drop this column if you don't need it or select and work with subframes, or groupby by dates and iterate over groups. Or, if you want to avoid pd.read_sql completely, maybe copy data with copy_expert/copy_to to a string buffer and load it to a data frame with pd.read_csv. It's just from the top of my head.Interlineate
AttributeError: '_io.StringIO' object has no attribute 'write_cells' Decadence
M
6

I had problems implementing the solution from ptrj.

I think the issue stems from pandas setting the pos of the buffer to the end.

See as follows:

from StringIO import StringIO
df = pd.DataFrame({"name":['foo','bar'],"id":[1,2]})
s_buf = StringIO()
df.to_csv(s_buf)
s_buf.__dict__

# Output
# {'softspace': 0, 'buflist': ['foo,1\n', 'bar,2\n'], 'pos': 12, 'len': 12, 'closed': False, 'buf': ''}

Notice that pos is at 12. I had to set the pos to 0 in order for the subsequent copy_from command to work

s_buf.pos = 0
cur = conn.cursor()
cur.copy_from(s_buf, tablename, sep=',')
conn.commit()
Margiemargin answered 19/7, 2016 at 20:42 Comment(2)
I ran that code over the weekend and didn't get any errors. When I got to the office on Monday hough my tables were empty which sucked. I ended up writing temporary csv files which does work completely. So if I do to_csv(s_buf) and then s_buf.pos=0 then it will work without writing a csv?Whitewall
@Margiemargin @Whitewall Good point! It's a standard behaviour of file/stream buffers (i.e. not the fault of pandas). I just forgot to include it, my apologies. An idiomatic way to set the position to the beginning is to run s_buf.seek(0).Interlineate
F
3

The pandas.DataFrame API (since v1.0) will output a string if the file object is not specified. For example:

df = pd.DataFrame([{'x': 1, 'y': 1}, {'x': 2, 'y': 4}, {'x': 3, 'y': 9}])

# outputs to a string
csv_as_string = df.to_csv(index=False)
print(repr(csv_as_string))  # prints 'x,y\r\n1,1\r\n2,4\r\n3,9\r\n' (on windows)

# outputs to a file
with open('example.csv', 'w', newline='') as f:
    df.to_csv(f, index=False)  # writes to file, returns None

From the current (v1.4.3) docs:

path_or_buf : str, path object, file-like object, or None, default None
String, path object (implementing os.PathLike[str]), or file-like object implementing a write() function. If None, the result is returned as a string. If a non-binary file object is passed, it should be opened with newline=’’, disabling universal newlines. If a binary file object is passed, mode might need to contain a ‘b’.

Fabio answered 28/6, 2022 at 3:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.