Python is slow when iterating over a large list

Asked 22/2, 2012 at 19:59 Answered 22/2, 2012 at 21:27

I am currently selecting a large list of rows from a database using pyodbc. The result is then copied to a large list, and then i am trying to iterate over the list. Before I abandon python, and try to create this in C#, I wanted to know if there was something I was doing wrong.

clientItems.execute("Select ids from largetable where year =?", year);
allIDRows = clientItemsCursor.fetchall() #takes maybe 8 seconds.

for clientItemrow in allIDRows:
    aID = str(clientItemRow[0])
    # Do something with str -- Removed because I was trying to determine what was slow
    count = count+1

Some more information:

The for loop is currently running at about 5 loops per second, and that seems insanely slow to me.
The total rows selected is ~489,000.
The machine its running on has lots of RAM and CPU. It seems to only run one or two cores, and ram is 1.72GB of 4gb.

Can anyone tell me whats wrong? Do scripts just run this slow?

Thanks

Barahona answered 22/2, 2012 at 19:59 Comment(4)

Is clientItemRow[0] really big? 489,000 is a low of rows, but 5rows/s is laughably slow. Also, someone can correct me if I'm wrong, but I'm pretty sure you're code, as written, will only run on one core, but should still be miles faster than 5 iterations per second. Also, you can use the builtin cProfile to see where you're hitting a bottle-neck. – Klimt 22/2, 2012 at 20:9

As for the CPU usage - if you are doing fancy things with the results you retrieve, you can get more cpu cores into the mix by using Python's multiprocessing module - but lets get this interation sorted out first. – Beadruby 22/2, 2012 at 20:9

what does type(allIDRows) return? – Perkins 22/2, 2012 at 20:10

@prelic: I suspect there is something terribly wrong with watherver object is returned by the odbc driver "fetchall" - the only way it could be that slow is if iterating over it is O(N) or worse. – Beadruby 22/2, 2012 at 20:11

This should not be slow with Python native lists - but maybe ODBC's driver is returning a "lazy" object that tries to be smart but just gets slow. Try just doing

allIDRows = list(clientItemsCursor.fetchall())

in your code and post further benchmarks.

(Python lists can get slow if you start inserting things in its middle, but just iterating over a large list should be fast)

Beadruby answered 22/2, 2012 at 20:5 Comment(4)

WOW. I went ahead and made this change, and the number per second went over 1000. This is amazing. Will be accepting this as the answer. – Barahona 22/2, 2012 at 20:9

It would be a nice thing to find the web page for the pyodbc project and enter this as a bug. If fetchall is not returning a list, it should be soemthing more efficient than a list, not a show stopper – Beadruby 22/2, 2012 at 20:16

Great answer, got me thinking is a list the fastest way of storing collected data like this? Would a set, tuple, dict(with dummy value), etc be any faster than a list like this? – Ferial 22/2, 2012 at 20:41

By iterating over the list I went from 2000 rows/s and slowing over time to 9000 rows/s with no performance loss. THANK YOU!! – Goldfarb 14/7, 2017 at 15:36

It's probably slow because you load all result in memory first and performing the iteration over a list. Try iterating the cursor instead.

And no, scripts shouldn't be that slow.

clientItemsCursor.execute("Select ids from largetable where year =?", year);
for clientItemrow in clientItemsCursor:
    aID = str(clientItemrow[0])
    count = count + 1

Thermodynamic answered 22/2, 2012 at 20:4 Comment(2)

That is in the opposite direction of what I suggested - only testing will tell, and I hope the O.P. keeps us updated. Iteration over a list should not be slow anyway, but possibly for clientItemRow in cursor.fetchall(): ... as you suggest is better. – Beadruby 22/2, 2012 at 20:8

I have tried this menthol, and it was not faster at all. This does match the pyODBC documentation. It is still very slow to iterate of them. – Barahona 22/2, 2012 at 20:13

More investigation is needed here... consider the following script:

bigList = range(500000)
doSomething = ""
arrayList = [[x] for x in bigList]  # takes a few seconds
for x in arrayList:
    doSomething += str(x[0])
    count+=1

This is pretty much the same as your script, minus the database stuff, and takes a few seconds to run on my not-terribly-fast machine.

Calico answered 22/2, 2012 at 20:8 Comment(0)

When you connect to your database directly (I mean you get an SQL prompt), how many secods runs this query?

When query ends, you get a message like this:

NNNNN rows in set (0.01 sec)

So, if that time is so big, and your query is slow as "native", may be you have to create an index on that table.

Goodyear answered 22/2, 2012 at 20:41 Comment(0)

This is slow because you are

Getting all the results
Allocating memory and assigning the values to that memory to create the list allIDRows
Iterating over that list and counting.

If execute gives you back a cursor then use the cursor to it's advantage and start counting as you get stuff back and save time on the mem allocation.

clientItemsCursor.execute("Select ids from largetable where year =?", year);
for clientItemrow in clientItemsCursor:
   count +=1

Other hints:

create an index on year
use 'select count(*) from ... to get the count for the year' this will probably be optimised on the db.
Remove the aID line if not needed this is converting the first item of the row to a string even though its not used.

Hyperkeratosis answered 22/2, 2012 at 21:27 Comment(0)

Recommended topics

Hot tags