I have the below code which queries a database of about 500k rows. and it throws a SIGKILL when it hits rows = cur.fetchall()
. I've tried to iterate through the cursor rather than load it all up into rows, but it still seems to cause OOM issues.
How can I grab all the data from a database and safely convert it into a parquet file regardless of the size of the table?
def get_parquet_for_dataset_id(self, dataset, lob, max_dt):
query = _table_query(lob, table_name, max_dt)
conn = self.conns[lob]
with conn:
with conn.cursor(cursor_factory=extras.RealDictCursor) as cur:
cur.execute(query)
rows = cur.fetchall()
table = rows_to_table(rows)
pq_bytes = io.BytesIO()
pq.write_table(table, pq_bytes)
_ = pq_bytes.seek(0)
return pq_bytes;
yield
ing data from an iterator in chunks? – Psychophysicscur.fetchall()
So i guess the question about the parquet file is irrelevant right now but if there was a way to avoid writing all of these things into memory than that would be ideal. – Elonorefetchall()
at all with data larger than memory (this should be obvious on its face; you can't fetch something you can't fit into process space). – Psychophysicsfetchmany()
exists, so you can get a smaller collection of rows at a time, and just repeat that until you've processed all of them. Of course, if you're getting smaller collections at a time, then you need to be able to add them to your parquet file incrementally (so you can then stop storing them in memory -- because if you just add eachfetchmany()
's result to a list stored in RAM you're back to your original problem), and needing to have code to do that is why the question isn't answerable without more content included. – Psychophysics