python postgres can I fetchall() 1 million rows?
Asked Answered
S

6

36

I am using psycopg2 module in python to read from postgres database, I need to some operation on all rows in a column, that has more than 1 million rows.

I would like to know would cur.fetchall() fail or cause my server to go down? (since my RAM might not be that big to hold all that data)

q="SELECT names from myTable;"
cur.execute(q)
rows=cur.fetchall()
for row in rows:
    doSomething(row)

what is the smarter way to do this?

Stethoscope answered 29/7, 2013 at 20:14 Comment(2)
I'd say the smarter way to do this is to find a way to do whatever you're planning to do with all those rows on the database server side. If you find yourself fetching a million rows from the database, there's probably something wrong with your approach.Brittnee
@mustaccio, it's not a good point. You don't know what is the business logic and data set behind the OP question.Analemma
S
30

fetchall() fetches up to the arraysize limit, so to prevent a massive hit on your database you can either fetch rows in manageable batches, or simply step through the cursor till its exhausted:

row = cur.fetchone()
while row:
   # do something with row
   row = cur.fetchone()
Sousaphone answered 29/7, 2013 at 20:20 Comment(0)
S
49

The solution Burhan pointed out reduces the memory usage for large datasets by only fetching single rows:

row = cursor.fetchone()

However, I noticed a significant slowdown in fetching rows one-by-one. I access an external database over an internet connection, that might be a reason for it.

Having a server side cursor and fetching bunches of rows proved to be the most performant solution. You can change the sql statements (as in alecxe answers) but there is also pure python approach using the feature provided by psycopg2:

cursor = conn.cursor('name_of_the_new_server_side_cursor')
cursor.execute(""" SELECT * FROM table LIMIT 1000000 """)

while True:
    rows = cursor.fetchmany(5000)
    if not rows:
        break

    for row in rows:
        # do something with row
        pass

you find more about server side cursors in the psycopg2 wiki

Skindive answered 8/11, 2013 at 15:11 Comment(3)
Can you define "fetching"? Because the query is still completely executed, so the data is stored somewhere. On the database, it is still a single query, so I don't understand how fetchone or fetchmany saves memory.Mathison
good question @StevenWade , doing it this way you won't reduce the memory consumption of the db instance, that's right. however I understood this question is directed to avoid memory consumption of the instance that connects to the database and fetches data.Skindive
So how does it reduce the memory in the application? Does it break it up at the packet level? It must limit the amount of data that comes back somehow. It's not breaking up the query and all of the results couldn't come back all at once or it wouldn't be reducing the memory usage.Mathison
B
33

Consider using server side cursor:

When a database query is executed, the Psycopg cursor usually fetches all the records returned by the backend, transferring them to the client process. If the query returned an huge amount of data, a proportionally large amount of memory will be allocated by the client.

If the dataset is too large to be practically handled on the client side, it is possible to create a server side cursor. Using this kind of cursor it is possible to transfer to the client only a controlled amount of data, so that a large dataset can be examined without keeping it entirely in memory.

Here's an example:

cursor.execute("DECLARE super_cursor BINARY CURSOR FOR SELECT names FROM myTable")
while True:
    cursor.execute("FETCH 1000 FROM super_cursor")
    rows = cursor.fetchall()

    if not rows:
        break

    for row in rows:
        doSomething(row)
Bartley answered 29/7, 2013 at 20:17 Comment(5)
thanks what would be the advantage of this method and the answer that Burhan Khalid gave? https://mcmap.net/q/417949/-python-postgres-can-i-fetchall-1-million-rowsStethoscope
Burhan's answer looks simpler and will work too. But, making server give you controlled amount of data on each step of iteration loop is a way to go.Bartley
Works for me withou the BINARY keywordHyperbole
psycopg2 doesn't support BINARY cursors as stated in the following link: postgresql.org/message-id/…. If somebody has any information on how to actually use a BINARY cursor and not just a named cursor through psycopg2. That would be greatly appreciated.Dynah
When I restart the code, the cursor starts from the last record that it has processed instead of going to the beginning of the DB. How can I point the cursor to the beginning of the table.Sturges
S
30

fetchall() fetches up to the arraysize limit, so to prevent a massive hit on your database you can either fetch rows in manageable batches, or simply step through the cursor till its exhausted:

row = cur.fetchone()
while row:
   # do something with row
   row = cur.fetchone()
Sousaphone answered 29/7, 2013 at 20:20 Comment(0)
R
5

Here is the code to use for simple server side cursor with the speed of fetchmany management.

The principle is to use named cursor in Psycopg2 and give it a good itersize to load many rows at once like fetchmany would do but with a single loop of for rec in cursor that does an implicit fetchnone().

With this code I make queries of 150 millions rows from multi-billion rows table within 1 hour and 200 meg ram.

Retribution answered 17/4, 2015 at 19:28 Comment(0)
V
1

EDIT: using fetchmany (along with fetchone() and fetchall(), even with a row limit (arraysize) will still send the entire resultset, keeping it client-side (stored in the underlying c library, I think libpq) for any additional fetchmany() calls, etc. Without using a named cursor (which would require an open transaction), you have to resort to using limit in the sql with an order-by, then analyzing the results and augmenting the next query with where (ordered_val = %(last_seen_val)s and primary_key > %(last_seen_pk)s OR ordered_val > %(last_seen_val)s)

This is misleading for the library to say the least, and there should be a blurb in the documentation about this. I don't know why it's not there.


Not sure a named cursor is a good fit without having a need to scroll forward/backward interactively? I could be wrong here.

The fetchmany loop is tedious but I think it's the best solution here. To make life easier, you can use the following:

from functools import partial
from itertools import chain

# from_iterable added >= python 2.7
from_iterable = chain.from_iterable

# util function
def run_and_iterate(curs, sql, parms=None, chunksize=1000):
    if parms is None:
        curs.execute(sql)
    else:
        curs.execute(sql, parms)
    chunks_until_empty = iter(partial(fetchmany, chunksize), [])
    return from_iterable(chunks_until_empty)

# example scenario
for row in run_and_iterate(cur, 'select * from waffles_table where num_waffles > %s', (10,)):
    print 'lots of waffles: %s' % (row,)
Vauban answered 13/7, 2019 at 0:33 Comment(0)
D
0

As I was reading comments and answers I thought I should clarify something about fetchone and Server-side cursors for future readers.

With normal cursors (client-side), Psycopg fetches all the records returned by the backend, transferring them to the client process. The whole records are buffered in the client's memory. It is when you execute a query like curs.execute('SELECT * FROM ...'.

This question also confirms that.

All the fetch* methods are there for accessing this stored data.

Q: So how fetchone can help us memory wise ?
A: It fetches only one record from the stored data and creates a single Python object and hands you in your Python code while fetchall will fetch and create n Python objects from this data and hands it to you all in one chunk.

So If your table has 1,000,000 records, this is what's going on in memory:

curs.execute --> whole 1,000,000 result set  +  fetchone --> 1 Python object
curs.execute --> whole 1,000,000 result set  +  fetchall --> 1,000,000 Python objects

Of-course fetchone helped but still we have the whole records in memory. This is where Server-side cursors comes into play:

PostgreSQL also has its own concept of cursor (sometimes also called portal). When a database cursor is created, the query is not necessarily completely processed: the server might be able to produce results only as they are needed. Only the results requested are transmitted to the client: if the query result is very large but the client only needs the first few records it is possible to transmit only them. ... their interface is the same, but behind the scene they send commands to control the state of the cursor on the server (for instance when fetching new records or when moving using scroll()).

So you won't get the whole result set in one chunk.

The draw-back :

The downside is that the server needs to keep track of the partially processed results, so it uses more memory and resources on the server.

Dock answered 6/2, 2022 at 16:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.