Pandas: Exponentially decaying sum with variable weights
Asked Answered
M

2

10

Similar to this question Exponential Decay on Python Pandas DataFrame, I would like to quickly compute exponentially decaying sums for some columns in a data frame. However, the rows in the data frame are not evenly spaced in time. Hence while exponential_sum[i] = column_to_sum[i] + np.exp(-const*(time[i]-time[i-1])) * exponential_sum[i-1], the weight np.exp(...) does not factor out and it's not obvious to me how to change to that question and still take advantage of pandas/numpy vectorization. Is there a pandas vectorized solution to this problem?

To illustrate the desired calculation, here is a sample frame with the exponential moving sum of A stored in Sum using a decay constant of 1:

    time  A       Sum
0   1.00  1  1.000000
1   2.10  3  3.332871
2   2.13 -1  2.234370
3   3.70  7  7.464850
4  10.00  2  2.013708
5  10.20  1  2.648684
Maulmain answered 23/10, 2015 at 3:38 Comment(4)
can you resample your dataframe so that it is evenly spaced?Procurer
@Alexander I am asking about sums, not averages, though maybe there is an obvious transformMaulmain
@Alexander I just read that question more carefully and I don't think it addresses my question, which is how to the vectorized numpy/pandas calculation. I don't have any issue computing the exponential sums in a python loop, I'm just doing it on sufficiently large frames that being able to vectorize the calculation matters.Maulmain
Can you please provide some sample data?Impassible
I
7

This question is more complicated than it first appeared. I ended up using numba's jit to compile a generator function to calculate the exponential sums. My end result calculates the exponential sum of 5 million rows in under a second on my computer, which hopefully is fast enough for your needs.

# Initial dataframe.
df = pd.DataFrame({'time': [1, 2.1, 2.13, 3.7, 10, 10.2], 
                   'A': [1, 3, -1, 7, 2, 1]})

# Initial decay parameter.
decay_constant = 1

We can define the decay weights as exp(-time_delta * decay_constant), and set its initial value equal to one:

df['weight'] = np.exp(-df.time.diff() * decay_constant)
df.weight.iat[0] = 1

>>> df
   A   time    weight
0  1   1.00  1.000000
1  3   2.10  0.332871
2 -1   2.13  0.970446
3  7   3.70  0.208045
4  2  10.00  0.001836
5  1  10.20  0.818731

Now we'll use jit from numba to optimize a generator function that calculates the exponential sums:

from numba import jit

@jit(nopython=True)
def exponential_sum(A, k):
    total = A[0]
    yield total
    for i in xrange(1, len(A)):  # Use range in Python 3.
        total = total * k[i] + A[i]
        yield total

We'll use the generator to add the values to the dataframe:

df['expSum'] = list(exponential_sum(df.A.values, df.weight.values))

Which produces the desired output:

>>> df
   A   time    weight    expSum
0  1   1.00  1.000000  1.000000
1  3   2.10  0.332871  3.332871
2 -1   2.13  0.970446  2.234370
3  7   3.70  0.208045  7.464850
4  2  10.00  0.001836  2.013708
5  1  10.20  0.818731  2.648684

So let's scale to 5 million rows and check performance:

df = pd.DataFrame({'time': np.random.rand(5e6).cumsum(), 'A': np.random.randint(1, 10, 5e6)})
df['weight'] = np.exp(-df.time.diff() * decay_constant)
df.weight.iat[0] = 1

%%timeit -n 10 
df['expSum'] = list(exponential_sum(df.A.values, df.weight.values))
10 loops, best of 3: 726 ms per loop
Impassible answered 24/10, 2015 at 3:49 Comment(1)
I was using Cython for a similar solution, but had been hoping there was a clever use of numpy/scipy that I was missing. It seems the consensus is no. A variation of this answer seems to be the best you can do.Maulmain
B
0

Expanding on the answer you linked to, I came up with the following method.

First, notice that:

exponential_sum[i] = column_to_sum[i] + 
    np.exp(-const*(time[i]-time[i-1])) * column_to_sum[i-1] + 
    np.exp(-const*(time[i]-time[i-2])) * column_to_sum[i-2] + ...

So the main change to make is in generating the weightspace to match the formula above. I proceeded like this:

time = pd.Series(np.random.rand(10)).cumsum()
weightspace = np.empty((10,10))
for i in range(len(time)):
    weightspace[i] = time - time[i]
weightspace = np.exp(weightspace)

Don't worry about the lower left triangle of the matrix, it won't be used. By the way, there must be a way of generating the weightspace without a loop.

Then a slight change in how you pick the weights from the weightspace in the rolling function:

def rollingsum(array):
    weights = weightspace[len(array)-1][:len(array)]
    # Convolve the array and the weights to obtain the result
    a = np.dot(array, weights).sum()
    return a

Works as expected:

dataset = pd.DataFrame(np.random.rand(10,3), columns=["A", "B","C"])
a = pd.expanding_apply(dataset, rollingsum)
Brumby answered 23/10, 2015 at 8:4 Comment(2)
One concern about this solution is that weightspace is now very big. In the solution to the regular case it was linear in the size of the data frame and now it's quadratic. This makes it problematic for large frames. Large frames are why the vectorized solution is needed. Is that unavoidable?Maulmain
Short of an optimized for loop like @Impassible suggested I'm afraid I don't see another way.Brumby

© 2022 - 2024 — McMap. All rights reserved.