How to aggregate timeseries in Python?
Asked Answered
C

1

9

I have two different timeseries with partially overlapping timestamps:

import scikits.timeseries as ts
from datetime import datetime 
a = ts.time_series([1,2,3], dates=[datetime(2010,10,20), datetime(2010,10,21), datetime(2010,10,23)], freq='D')
b = ts.time_series([4,5,6], dates=[datetime(2010,10,20), datetime(2010,10,22), datetime(2010,10,23)], freq='D')

which represents following data:

Day:   20. 21. 22. 23.
  a:    1   2   -   3
  b:    4   -   5   6

I would like to calculate a weighted average on every day with coefficients a(0.3) and b(0.7), while ignoring missing values:

Day 20.: (0.3 * 1 + 0.7 * 4) / (0.3 + 0.7) = 3.1 / 1.  = 3.1
Day 21.: (0.3 * 2          ) / (0.3      ) = 0.6 / 0.3 = 2
Day 22.: (          0.7 * 5) / (      0.7) = 3.5 / 0.7 = 5
Day 23.: (0.3 * 3 + 0.7 * 6) / (0.3 + 0.7) = 3.1 / 1.  = 5.1

when I first try to align these timeseries:

a1, b1 = ts.aligned(a, b)

I get correctly masked timeseries:

timeseries([1 2 -- 3],
  dates = [20-Oct-2010 ... 23-Oct-2010],
  freq  = D)

timeseries([4 -- 5 6],
  dates = [20-Oct-2010 ... 23-Oct-2010],
  freq  = D)

but when I do a1 * 0.3 + b1 * 0.7, it ignores values, that are present in one timeseries only:

timeseries([3.1 -- -- 5.1],
   dates = [20-Oct-2010 ... 23-Oct-2010],
   freq  = D)

What should I do to receive the awaited?

timeseries([3.1 2. 5. 5.1],
   dates = [20-Oct-2010 ... 23-Oct-2010],
   freq  = D)

EDIT: The answer should be applicable also to more than two initial timeseries with different weights and differently missing values.

So if we have four timeseries with weights T1(0.1), T2(0.2), T3(0.3) and T4(0.4), their weights at a given timestamp will be:

            |  T1 |  T2 |  T3 |  T4 |
weight      | 0.1 | 0.2 | 0.3 | 0.4 |
-------------------------------------
all present | 10% | 20% | 30% | 40% |
T1 missing  |     | 22% | 33% | 45% |
T1,T2 miss. |     |     | 43% | 57% |
T4 missing  | 17% | 33% | 50% |     |
etc.
Cheng answered 20/10, 2010 at 12:2 Comment(6)
"more than two initial timeseries"? You mean T1, T2, T3? Isn't that just ((T1 agg T2) agg T3)? In which case, any number of time series can be aggregated by simply applying the solution as a reduction. If not, why not?Bine
@Bine - not really. How would you handle weights with T1(0.2), T2(0.2) and T3(0.6)? If at a given timestamp T1 is missing, then the 0.6 of T3 represents actually 75% (T2 has then 25%) and not 60% of the whole group. In your ((T1 agg T2) agg T3) logic this would not work.Cheng
@eumiro: Please update your question with this requirement.Bine
@Bine - I hope it was clear.Cheng
@eumiro: Why is your comment very detailed and the question not very detailed? Please update your question to be complete.Bine
@S.Lott, you're right, I have completed it a little bit.Cheng
C
3

I have tried and found this:

aWgt = 0.3
bWgt = 0.7

print (np.where(a1.mask, 0., a1.data * aWgt) +
       np.where(b1.mask, 0., b1.data * bWgt)) / (np.where(a1.mask, 0., aWgt) +
                                                 np.where(b1.mask, 0., bWgt))

# array([ 3.1,  2. ,  5. ,  5.1])

This is applicable to the edited question with more than one initial timeseries. But hopefully someone will find better.

EDIT: And this is my function:

def weightedAvg(weightedTimeseries):
    sumA = np.sum((np.where(ts.mask, 0., ts.data * weight) for ts, weight in weightedTimeseries), axis=0)
    sumB = np.sum((np.where(ts.mask, 0., weight) for ts, weight in weightedTimeseries), axis=0)
    return np.divide(sumA, sumB)

weightedAvg(((a1, 0.3), (bb, 0.7)))
# array([ 3.1,  2. ,  5. ,  5.1])

Works for any number of timeseries ;-)

Cheng answered 20/10, 2010 at 13:9 Comment(1)
I think your weightedAvg is better than the one I suggested because it is simpler, and takes less memory. Very nice!Aposiopesis

© 2022 - 2024 — McMap. All rights reserved.