pandas df.loc[z,x]=y how to improve speed?
Asked Answered
N

4

24

I have identified one pandas command

timeseries.loc[z, x] = y

to be responsible for most of the time spent in an iteration. And now I am looking for better approaches to accelerate it. The loop covers not even 50k elements (and production goal is ~250k or more), but already needs a sad 20 seconds.

Here is my code (ignore the top half, it is just the timing helper)

def populateTimeseriesTable(df, observable, timeseries):
    """
    Go through all rows of df and 
    put the observable into the timeseries 
    at correct row (symbol), column (tsMean).
    """

    print "len(df.index)=", len(df.index)  # show number of rows

    global bf, t
    bf = time.time()                       # set 'before' to now
    t = dict([(i,0) for i in range(5)])    # fill category timing with zeros

    def T(i):
        """
        timing helper: Add passed time to category 'i'. Then set 'before' to now.
        """
        global bf, t 
        t[i] = t[i] + (time.time()-bf)
        bf = time.time()        

    for i in df.index:             # this is the slow loop
        bf = time.time()

        sym = df["symbol"][i]
        T(0)

        tsMean = df["tsMean"][i]
        T(1)

        tsMean = tsFormatter(tsMean)
        T(2)

        o = df[observable][i]
        T(3)

        timeseries.loc[sym, tsMean] = o
        T(4)

    from pprint import pprint
    print "times needed (total = %.1f seconds) for each command:" % sum(t.values())
    pprint (t)

    return timeseries

With (not important, not slow)

def tsFormatter(ts):
    "as human readable string, only up to whole seconds"
    return time.strftime("%Y-%m-%d %H:%M:%S", time.gmtime(ts))

. .

--> The to-be-optimized code is in the for-loop.

(T, and t are just helper function & dict, for the timing.)

I have timed every step. The vast majority of time:

len(df.index)= 47160
times needed (total = 20.2 seconds) for each command:
{0: 1.102,
 1: 0.741,
 2: 0.243,
 3: 0.792,
 4: 17.371}

is spent in the last step

timeseries.loc[sym, tsMean] = o

I have already downloaded and install pypy - but sadly, that doesn't support pandas yet.

Any ideas how to speed up populating a 2D array?

Thanks!


Edit: Sorry, hadn't mentioned - 'timeseries' is a dataframe too:

timeseries = pd.DataFrame({"name": titles}, index=index)
Nonmaterial answered 10/6, 2016 at 22:14 Comment(3)
I don't know what kind of object timeseries is. But, if it has a '.loc` method, it may have a .at method. If you are assigning at a specific location, .at should be quicker.Malamut
Edit: Sorry, hadn't mentioned: timeseries is a dataframe: timeseries = pd.DataFrame({"name": titles}, index=index) I have added that to the OP now.Nonmaterial
I will be looking into that .at function. Thanks a lot, @MalamutNonmaterial
M
33

UPDATE: starting from Pandas 0.20.1 the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers.

=====================================================================

@jezrael has provided an interesting comparison and i decided to repeat it using more indexing methods and against 10M rows DF (actually the size doesn't matter in this particular case):

setup:

In [15]: df = pd.DataFrame(np.random.rand(10**7, 5), columns=list('abcde'))

In [16]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 5 columns):
a    float64
b    float64
c    float64
d    float64
e    float64
dtypes: float64(5)
memory usage: 381.5 MB

In [17]: df.shape
Out[17]: (10000000, 5)

Timing:

In [37]: %timeit df.loc[random.randint(0, 10**7), 'b']
1000 loops, best of 3: 502 µs per loop

In [38]: %timeit df.iloc[random.randint(0, 10**7), 1]
1000 loops, best of 3: 394 µs per loop

In [39]: %timeit df.at[random.randint(0, 10**7), 'b']
10000 loops, best of 3: 66.8 µs per loop

In [41]: %timeit df.iat[random.randint(0, 10**7), 1]
10000 loops, best of 3: 32.9 µs per loop

In [42]: %timeit df.ix[random.randint(0, 10**7), 'b']
10000 loops, best of 3: 64.8 µs per loop

In [43]: %timeit df.ix[random.randint(0, 10**7), 1]
1000 loops, best of 3: 503 µs per loop

Results as a bar plot:

enter image description here

Timing data as DF:

In [88]: r
Out[88]:
       method  timing
0         loc   502.0
1        iloc   394.0
2          at    66.8
3         iat    32.9
4    ix_label    64.8
5  ix_integer   503.0

In [89]: r.to_dict()
Out[89]:
{'method': {0: 'loc',
  1: 'iloc',
  2: 'at',
  3: 'iat',
  4: 'ix_label',
  5: 'ix_integer'},
 'timing': {0: 502.0,
  1: 394.0,
  2: 66.799999999999997,
  3: 32.899999999999999,
  4: 64.799999999999997,
  5: 503.0}}

Plotting

ax = sns.barplot(data=r, x='method', y='timing')
ax.tick_params(labelsize=16)
[ax.annotate(str(round(p.get_height(),2)), (p.get_x() + 0.2, p.get_height() + 5)) for p in ax.patches]
ax.set_xlabel('indexing method', size=20)
ax.set_ylabel('timing (microseconds)', size=20)
Marxmarxian answered 11/6, 2016 at 7:58 Comment(3)
wow wow wow. Thank you so much. This is truly enlightening. So my scepticism about such slowness was justified indeed. I had chosen perhaps the most comfortable solution, using .loc - but I bought it with a 7.5 to 15.3 fold punishment in time complexity, compared to .at and .iat - thanks a lot, this is very valuable information!Nonmaterial
Done. Thx! Much appreciated.Nonmaterial
Unfortunately, I cannot accept BOTH answers. Hmmmm ... yours has this really easy to read histogram output. But @jezrael was first to answer, and his answer is very good too. What to do now?Nonmaterial
G
6

I always think at is the fastest, but not. ix is faster:

import pandas as pd

df = pd.DataFrame({'A':[1,2,3],
                   'B':[4,5,6],
                   'C':[7,8,9],
                   'D':[1,3,5],
                   'E':[5,3,6],
                   'F':[7,4,3]})

print (df)
   A  B  C  D  E  F
0  1  4  7  1  5  7
1  2  5  8  3  3  4
2  3  6  9  5  6  3

print (df.at[2, 'B'])
6
print (df.ix[2, 'B'])
6
print (df.loc[2, 'B'])
6

In [77]: %timeit df.at[2, 'B']
10000 loops, best of 3: 44.6 µs per loop

In [78]: %timeit df.ix[2, 'B']
10000 loops, best of 3: 40.7 µs per loop

In [79]: %timeit df.loc[2, 'B']
1000 loops, best of 3: 681 µs per loop

EDIT:

I try MaxU df and differences are caused random.randint function:

df = pd.DataFrame(np.random.rand(10**7, 5), columns=list('ABCDE'))


In [4]: %timeit (df.ix[2, 'B'])
The slowest run took 25.80 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 20.7 µs per loop

In [5]: %timeit (df.ix[random.randint(0, 10**7), 'B'])
The slowest run took 9.42 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 28 µs per loop
Gourmandise answered 11/6, 2016 at 5:25 Comment(13)
Very interesting. Thank you very much! Now I am wondering how to explain away the differences between your results, jezrael, and those which @MaxU has reported there: https://mcmap.net/q/548187/-pandas-df-loc-z-x-y-how-to-improve-speedNonmaterial
is it perhaps his df is 10M rows, and yours is small? So perhaps .ix is the fastest for small dataframes, and .at / .iat the fastest for huge ones?Nonmaterial
I think timings are best for comparing of methods and in each PC are little different. And these timings not depens on length of dataframe. and iat is the fastest, but you need position of column - if dont know position and only column name, you cannot use iat - using ix is the fastest approach.Gourmandise
The best is explain it in docs.Gourmandise
And if use anaconda - you can try check timings in your pc too.Gourmandise
One idea - it is difference between my and maxu results because I use position as int (2) and maxu generate random int - so of course generating of number is some time and select is another time - together is time a bit higher as me (I dont generate random position.)Gourmandise
I edit solution and add timings with generating int and not.Gourmandise
ahhh, yes. That can explain, indeed. Thank you very much, jezrael. I hope I can accept both of your solutions as solutions to this question.Nonmaterial
Oh no, I cannot. What a pity.Nonmaterial
Yes, only one can be accepted. And it is up to you. Nice day.Gourmandise
As both your answers are equally good ... I have thrown a thousand coins: sum([random.random() for _ in range(1000)])/1000 and that resulted in 0.4972785257694664 - so MaxU has won, by random choice :-) Nice day to you to!Nonmaterial
Ok. Maybe next time I will be winner. ;-)Gourmandise
@AltSheets, i like your approach! :)Marxmarxian
A
1

if you are adding rows inside a loop consider thses performance issues; for around first 1000 to 2000 records "my_df.loc" performance is better and gradually it is become slower by increasing the number of records in loop.

If you plan to do thins inside a big loop(say 10M‌ records or so) you are better to use a mixture of "iloc" and "append"; fill a temp datframe with iloc untill the size gets around 1000, then append it to the original dataframe, and empy the temp dataframe. this would boost your performance around 10 times

Arnulfo answered 23/4, 2018 at 5:37 Comment(0)
B
1

Update for newer pandas versions

pandas 2.2.2 on python 3.12.1

In [1]: import random; import numpy as np; import pandas as pd

In [2]: df = pd.DataFrame(np.random.rand(10**7, 5), columns=list("abcde"))

In [3]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 5 columns):
 #   Column  Dtype  
---  ------  -----  
 0   a       float64
 1   b       float64
 2   c       float64
 3   d       float64
 4   e       float64
dtypes: float64(5)
memory usage: 381.5 MB

In [4]: %timeit df.loc[random.randint(0, 10**7 - 1), "b"]
8.17 µs ± 130 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [5]: %timeit df.iloc[random.randint(0, 10**7 - 1), 1]
18.7 µs ± 285 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [6]: %timeit df.at[random.randint(0, 10**7 - 1), "b"]
3.91 µs ± 251 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [7]: %timeit df.iat[random.randint(0, 10**7 - 1), 1]
14.9 µs ± 371 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [8]: %timeit df.values[random.randint(0, 10**7 - 1), 1]
1.7 µs ± 29.1 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

So, direct access of .values numpy array is still the fastest by more than 2 times, and surprisingly, if you don't use this direct access, accessing by labels is faster than by indices.

Bedelia answered 29/4, 2024 at 7:21 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.