Does pandas iterrows have performance issues?
Asked Answered
S

9

151

I have noticed very poor performance when using iterrows from pandas.

Is it specific to iterrows and should this function be avoided for data of a certain size (I'm working with 2-3 million rows)?

This discussion on GitHub led me to believe it is caused when mixing dtypes in the dataframe, however the simple example below shows it is there even when using one dtype (float64). This takes 36 seconds on my machine:

import pandas as pd
import numpy as np
import time

s1 = np.random.randn(2000000)
s2 = np.random.randn(2000000)
dfa = pd.DataFrame({'s1': s1, 's2': s2})

start = time.time()
i=0
for rowindex, row in dfa.iterrows():
    i+=1
end = time.time()
print end - start

Why are vectorized operations like apply so much quicker? I imagine there must be some row by row iteration going on there too.

I cannot figure out how to not use iterrows in my case (this I'll save for a future question). Therefore I would appreciate hearing if you have consistently been able to avoid this iteration. I'm making calculations based on data in separate dataframes.

A simplified version of what I want to run:

import pandas as pd
import numpy as np

#%% Create the original tables
t1 = {'letter':['a','b'],
      'number1':[50,-10]}

t2 = {'letter':['a','a','b','b'],
      'number2':[0.2,0.5,0.1,0.4]}

table1 = pd.DataFrame(t1)
table2 = pd.DataFrame(t2)

#%% Create the body of the new table
table3 = pd.DataFrame(np.nan, columns=['letter','number2'], index=[0])

#%% Iterate through filtering relevant data, optimizing, returning info
for row_index, row in table1.iterrows():
    t2info = table2[table2.letter == row['letter']].reset_index()
    table3.ix[row_index,] = optimize(t2info,row['number1'])

#%% Define optimization
def optimize(t2info, t1info):
    calculation = []
    for index, r in t2info.iterrows():
        calculation.append(r['number2']*t1info)
    maxrow = calculation.index(max(calculation))
    return t2info.ix[maxrow]
Silence answered 21/7, 2014 at 17:19 Comment(6)
apply is NOT vectorized. iterrows is even worse as it boxes everything (that' the perf diff with apply). You should only use iterrows in very very few situations. IMHO never. Show what you are actually doing with iterrows.Pole
The issue you linked to instead has to do with the boxing of a DatetimeIndex into Timestamps (was implemented in python space), and this has been much improved in master.Pole
See this issue for a more full discussion: github.com/pydata/pandas/issues/7194.Pole
Link to the specific question (this one will stay general): stackoverflow.com/questions/24875096/…Silence
Please do not recommend the use of iterrows(). It is a blatant enabler of the worst anti-pattern in the history of pandas.Renatarenate
See also pots of .iterrows() performance in this answer here and my answer here.Psoas
P
266

Generally, iterrows should only be used in very, very specific cases. This is the general order of precedence for performance of various operations:

  1. vectorization
  2. using a custom Cython routine
  3. apply
    • reductions that can be performed in Cython
    • iteration in Python space
  4. itertuples
  5. iterrows
  6. updating an empty frame (e.g., using loc one-row-at-a-time)

Using a custom Cython routine is usually too complicated, so let's skip that for now.

  1. Vectorization is always, always the first and best choice. However, there is a small set of cases (usually involving a recurrence) which cannot be vectorized in obvious ways. Furthermore, on a smallish DataFrame, it may be faster to use other methods.

  2. apply usually can be handled by an iterator in Cython space. This is handled internally by pandas, though it depends on what is going on inside the apply expression. For example, df.apply(lambda x: np.sum(x)) will be executed pretty swiftly, though of course, df.sum(1) is even better. However something like df.apply(lambda x: x['b'] + 1) will be executed in Python space, and consequently is much slower.

  3. itertuples does not box the data into a Series. It just returns the data in the form of tuples.

  4. iterrows does box the data into a Series. Unless you really need this, use another method.

  5. Updating an empty frame a-single-row-at-a-time. I have seen this method used WAY too much. It is by far the slowest. It is probably common place (and reasonably fast for some Python structures), but a DataFrame does a fair number of checks on indexing, so this will always be very slow to update a row at a time. Much better to create new structures and concat.

Pole answered 21/7, 2014 at 17:39 Comment(7)
Yes, I used number 6 (and 5). I've got some learning to do. It seems like the obvious choice to a relative beginner.Silence
In my experience, the difference between 3, 4, and 5 is limited depending on the use case.Ahrendt
I've tried to check the runtimes in this notebook. Somehow itertuples is faster than apply :(Estes
pd.DataFrame.apply is often slower than itertuples. In addition, it's worth considering list comprehensions, map, the poorly named np.vectorize and numba (in no particular order) for non-vectorisable calculations, e.g. see this answer.Delimitate
@Jeff, out of curiosity, why have you not added list comprehensions here? While it is true that they do not handle index alignment or missing data (unless you use a function with a try-catch), they are good for a lot of use cases (string/regex stuff) where pandas methods do not have vectorized (in the truest sense of the word) implementations. Do you think it is worth mentioning LCs are a faster, lower overhead alternative to pandas apply and many pandas string functions?Renatarenate
Yeah, in my experience itertuples is faster than apply. I've sped up looping a dictionary lookup for all rows in a dataframe 1390X just by changing an apply to an itertuples(name=None). I was quite shocked by it frankly. Seems to be the same finding here: medium.com/swlh/… and here medium.com/swlh/…Cashandcarry
You may be interested in some plots I made in my answer here: How to iterate over rows in a DataFrame in Pandas. I go over 13 ways to iterate over Pandas DataFrames, many of which are done without iterating. Based on my results, I'd say, however, thes are the best approaches, in this order of best first: 1. vectorization, 2. list comprehension, 3. .itertuples(), 4. .apply(), 5. raw for loop, 6. .iterrows(). I didn't test Cython.Psoas
K
23

Vector operations in Numpy and pandas are much faster than scalar operations in vanilla Python for several reasons:

  • Amortized type lookup: Python is a dynamically typed language, so there is runtime overhead for each element in an array. However, Numpy (and thus pandas) perform calculations in C (often via Cython). The type of the array is determined only at the start of the iteration; this savings alone is one of the biggest wins.

  • Better caching: Iterating over a C array is cache-friendly and thus very fast. A pandas DataFrame is a "column-oriented table", which means that each column is really just an array. So the native actions you can perform on a DataFrame (like summing all the elements in a column) are going to have few cache misses.

  • More opportunities for parallelism: A simple C array can be operated on via SIMD instructions. Some parts of Numpy enable SIMD, depending on your CPU and installation process. The benefits to parallelism won't be as dramatic as the static typing and better caching, but they're still a solid win.

Moral of the story: use the vector operations in Numpy and pandas. They are faster than scalar operations in Python for the simple reason that these operations are exactly what a C programmer would have written by hand anyway. (Except that the array notion is much easier to read than explicit loops with embedded SIMD instructions.)

Kironde answered 21/7, 2014 at 17:41 Comment(0)
P
15

Here's the way to do your problem. This is all vectorized.

In [58]: df = table1.merge(table2,on='letter')

In [59]: df['calc'] = df['number1']*df['number2']

In [60]: df
Out[60]: 
  letter  number1  number2  calc
0      a       50      0.2    10
1      a       50      0.5    25
2      b      -10      0.1    -1
3      b      -10      0.4    -4

In [61]: df.groupby('letter')['calc'].max()
Out[61]: 
letter
a         25
b         -1
Name: calc, dtype: float64

In [62]: df.groupby('letter')['calc'].idxmax()
Out[62]: 
letter
a         1
b         2
Name: calc, dtype: int64

In [63]: df.loc[df.groupby('letter')['calc'].idxmax()]
Out[63]: 
  letter  number1  number2  calc
1      a       50      0.5    25
2      b      -10      0.1    -1
Pole answered 21/7, 2014 at 17:55 Comment(2)
Very clear answer thanks. I will try merging but I have doubts as I will then have 5 billion rows (2.5million*2000). In order to keep this Q general I've created a specific Q. I'd be happy to see an alternative to avoid this giant table, if you know of one: here:stackoverflow.com/questions/24875096/…Silence
this does not create the Cartesian product - it is a compressed space and is pretty memory efficient. what you are doing is a very standard problem. give a try. (your linked question has a very similar soln)Pole
R
13

DO NOT use iterrows!

...Or iteritems, or itertuples. Seriously, don't. Wherever possible, seek to vectorize your code. If you don't believe me, ask Jeff.

I will concede that there are legitimate use cases for iterating over a DataFrame, but there are far better alternatives for iteration than iter* family functions, namely

Often too many beginners to pandas ask questions involving code that has something to do with iterrows. Since these new users are likely not familiar with the concept of vectorization, they envision the code that solves their problem as something that involves loops or other iterative routines. Not knowing how to iterate either, they usually end up at this question and learn all the wrong things.


Supporting Arguments

The documentation page on iteration has a huge red warning box that says:

Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed [...].

If that doesn't convince you, take a look at the performance comparison between vectorized and non-vectorized techniques for adding two columns "A + B", taken from my post here.

Benchmarking code, for your reference. iterrows is by far the worst of the lot, and it's also worth pointing out that the other iterative methods aren't much better either.

The line at the bottom measures a function written in numpandas, a style of Pandas that mixes heavily with NumPy to squeeze out maximum performance. Writing numpandas code should be avoided unless you know what you're doing. Stick to the API where you can (i.e., prefer vec over vec_numpy).


In Conclusion

Always seek to vectorize. Sometimes, based on the nature of your problem or data this is not always possible, so seek better iterative routines than iterrows. There is almost never a legitimate use case for this besides convenience when dealing with an extremely small number of rows, otherwise be prepared for a lot of waiting while your code runs for hours potentially.

Check out the links below to determine the best method/vectorized routine to solve your code.

Renatarenate answered 18/12, 2020 at 11:23 Comment(0)
W
6

Another option is to use to_records(), which is faster than both itertuples and iterrows.

But for your case, there is much room for other types of improvements.

Here's my final optimized version

def iterthrough():
    ret = []
    grouped = table2.groupby('letter', sort=False)
    t2info = table2.to_records()
    for index, letter, n1 in table1.to_records():
        t2 = t2info[grouped.groups[letter].values]
        # np.multiply is in general faster than "x * y"
        maxrow = np.multiply(t2.number2, n1).argmax()
        # `[1:]`  removes the index column
        ret.append(t2[maxrow].tolist()[1:])
    global table3
    table3 = pd.DataFrame(ret, columns=('letter', 'number2'))

Benchmark test:

-- iterrows() --
100 loops, best of 3: 12.7 ms per loop
  letter  number2
0      a      0.5
1      b      0.1
2      c      5.0
3      d      4.0

-- itertuple() --
100 loops, best of 3: 12.3 ms per loop

-- to_records() --
100 loops, best of 3: 7.29 ms per loop

-- Use group by --
100 loops, best of 3: 4.07 ms per loop
  letter  number2
1      a      0.5
2      b      0.1
4      c      5.0
5      d      4.0

-- Avoid multiplication --
1000 loops, best of 3: 1.39 ms per loop
  letter  number2
0      a      0.5
1      b      0.1
2      c      5.0
3      d      4.0

Full code:

import pandas as pd
import numpy as np

#%% Create the original tables
t1 = {'letter':['a','b','c','d'],
      'number1':[50,-10,.5,3]}

t2 = {'letter':['a','a','b','b','c','d','c'],
      'number2':[0.2,0.5,0.1,0.4,5,4,1]}

table1 = pd.DataFrame(t1)
table2 = pd.DataFrame(t2)

#%% Create the body of the new table
table3 = pd.DataFrame(np.nan, columns=['letter','number2'], index=table1.index)


print('\n-- iterrows() --')

def optimize(t2info, t1info):
    calculation = []
    for index, r in t2info.iterrows():
        calculation.append(r['number2'] * t1info)
    maxrow_in_t2 = calculation.index(max(calculation))
    return t2info.loc[maxrow_in_t2]

#%% Iterate through filtering relevant data, optimizing, returning info
def iterthrough():
    for row_index, row in table1.iterrows():   
        t2info = table2[table2.letter == row['letter']].reset_index()
        table3.iloc[row_index,:] = optimize(t2info, row['number1'])

%timeit iterthrough()
print(table3)

print('\n-- itertuple() --')
def optimize(t2info, n1):
    calculation = []
    for index, letter, n2 in t2info.itertuples():
        calculation.append(n2 * n1)
    maxrow = calculation.index(max(calculation))
    return t2info.iloc[maxrow]

def iterthrough():
    for row_index, letter, n1 in table1.itertuples():   
        t2info = table2[table2.letter == letter]
        table3.iloc[row_index,:] = optimize(t2info, n1)

%timeit iterthrough()


print('\n-- to_records() --')
def optimize(t2info, n1):
    calculation = []
    for index, letter, n2 in t2info.to_records():
        calculation.append(n2 * n1)
    maxrow = calculation.index(max(calculation))
    return t2info.iloc[maxrow]

def iterthrough():
    for row_index, letter, n1 in table1.to_records():   
        t2info = table2[table2.letter == letter]
        table3.iloc[row_index,:] = optimize(t2info, n1)

%timeit iterthrough()

print('\n-- Use group by --')

def iterthrough():
    ret = []
    grouped = table2.groupby('letter', sort=False)
    for index, letter, n1 in table1.to_records():
        t2 = table2.iloc[grouped.groups[letter]]
        calculation = t2.number2 * n1
        maxrow = calculation.argsort().iloc[-1]
        ret.append(t2.iloc[maxrow])
    global table3
    table3 = pd.DataFrame(ret)

%timeit iterthrough()
print(table3)

print('\n-- Even Faster --')
def iterthrough():
    ret = []
    grouped = table2.groupby('letter', sort=False)
    t2info = table2.to_records()
    for index, letter, n1 in table1.to_records():
        t2 = t2info[grouped.groups[letter].values]
        maxrow = np.multiply(t2.number2, n1).argmax()
        # `[1:]`  removes the index column
        ret.append(t2[maxrow].tolist()[1:])
    global table3
    table3 = pd.DataFrame(ret, columns=('letter', 'number2'))

%timeit iterthrough()
print(table3)

The final version is almost 10x faster than the original code. The strategy is:

  1. Use groupby to avoid repeated comparing of values.
  2. Use to_records to access raw numpy.records objects.
  3. Don't operate on DataFrame until you have compiled all the data.
Witter answered 15/8, 2017 at 14:42 Comment(0)
T
4

Details in this video

Benchmark enter image description here

Theodoratheodore answered 1/6, 2020 at 16:26 Comment(0)
S
2

If you really need to iterate it and access row fields by name, just save the column names to a list and convert the dataframe to a NumPy array:

import pandas as pd
import numpy as np
import time

s1 = np.random.randn(2000000)
s2 = np.random.randn(2000000)
dfa = pd.DataFrame({'s1': s1, 's2': s2})
columns = list(dfa.columns)
dfa = dfa.values
start = time.time()
i=0
for row in dfa:
    blablabla = row[columns.index('s1')]
    i+=1
end = time.time()
print (end - start)

0.9485495090484619

Supersaturated answered 15/5, 2021 at 14:55 Comment(1)
This solution is great as it’s very easy to replace an existing .iterrows() loop with this and it’s many times faster without using up all memory & crashing. Is anyone aware of drawbacks or limitations to this method? So far all I’ve found is that dfa.values will automatically choose a dtype that’s compatible with all the columns’ dtypes and convert all the data to that single dtype, which AFAIK isn’t optimal for your DF but usually won’t break anything. NOTE: pandas.pydata.org/docs/reference/api/… recommends using .to_numpy() instead of .values.Misconstruction
L
1

Yes, Pandas itertuples() is faster than iterrows(). You can refer the documentation: pandas.DataFrame.iterrows

To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples of the values and which is generally faster than iterrows.

Luting answered 13/4, 2019 at 19:40 Comment(0)
D
0

Among the answers suggesting to not use iterrows, try this: dump the DataFrame to a (temporary) CSV file and re-load it again to do iterrows. It bumps the speed in my experiments over 100x for 6-7 million rows of data.

    # The three lines below are the reason why the `iterrows` becomes slow...
    all_df = ...
    for path in glob.glob("..."):
        all_df = pd.concat(all_df, pd.read_csv(path))

    # This is my solution, just dump it to a file
    all_df.to_csv("tmp.csv")
    all_df = pd.read_csv("tmp.csv")

    # because I like tqdm, so I have to use iterrows..
    for index, row in tqdm(df.iterrows(), total=all_df.shape[0]):
        # do something
        ... 

        # set something
        all_df.at[index, ..] = ...

p/s: not sure if there is any better solution..

Debase answered 2/1 at 21:53 Comment(1)
I believe this is because resetting the index speeds up the iterrows(). When you save to file and reload, the reload resets the index for you.Schwab

© 2022 - 2024 — McMap. All rights reserved.