Python PANDAS: Resampling Multivariate Time Series with a Groupby
Asked Answered
T

1

1

I have data in the following general format that I would like to resample to 30 day time series windows:

'customer_id','transaction_dt','product','price','units'
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25
3,2017-07-15,thing3,55,17
3,2016-05-12,thing3,55,47
4,2012-02-23,thing2,150,22
4,2009-10-10,thing1,25,12
4,2014-04-04,thing2,150,2
5,2008-07-09,thing2,150,43

I would like the 30 day windows to start on 2014-01-01 and end on 12-31-2018. It is NOT guaranteed that every customer will have records in every window. If a customer has multiple transactions in a window, then it takes the weighted average of the price, sums the units, and concat the product names to create one record per customer per window.

What I have so far is something like this:

wa = lambda x:np.average(x, weights=df.loc[x.index, 'units'])
con = lambda x: '/'.join(x))

agg_funcs = {'customer_id':'first',
             'product':'con',
             'price':'wa',
             'transaction_dt':'first',
             'units':'sum'}

df_window = df.groupby(['customer_id', pd.Grouper(freq='30D')]).agg(agg_funcs)
df_window_final = df_window.unstack('customer_id', fill_value=0)

If anyone knows some better ways to approach this problem (particularly with an in-place and/or vectorized method), I would appreciate it. Ideally, I would also like to add the window start and stop dates as columns to the rows as well.

The final output would look like this ideally:

'customer_id','transaction_dt','product','price','units','window_start_dt','window_end_dt'
1,2004-01-02,thing1/thing2,(weighted average price),(total units),(window_start_dt),(window_end_dt)
2,2004-01-29,thing2,(weighted average price),(total units),(window_start_dt),(window_end_dt)
3,2017-07-15,thing3,(weighted average price),(total units),(window_start_dt),(window_end_dt)
3,2016-05-12,thing3,(weighted average price),(total units),(window_start_dt),(window_end_dt)
4,2012-02-23,thing2,(weighted average price),(total units),(window_start_dt),(window_end_dt)
4,2009-10-10,thing1,(weighted average price),(total units),(window_start_dt),(window_end_dt)
4,2014-04-04,thing2,(weighted average price),(total units),(window_start_dt),(window_end_dt)
5,2008-07-09,thing2,(weighted average price),(total units),(window_start_dt),(window_end_dt)
Traditionalism answered 6/10, 2017 at 18:10 Comment(3)
Weighted Avg Price requires a weight for which to average with. What is the weight. And, so that there is no ambiguity, what should the final result look like so that those who decide to help have something to compare with before they submit an answer.Impassible
Sorry, if it was confusing, this is supposed to calculate the weighted average price from the total number of units in a window during the groupby: wa = lambda x:np.average(x, weights=df.loc[x.index, 'units'])Traditionalism
The weight for price is (# of) units.Traditionalism
F
1

Edited for new solution. I think you can convert each of the transaction_dt to a Period object of 30 days and then do the grouping.

p = pd.period_range('2004-1-1', '12-31-2018',freq='30D')
def find_period(v):
    p_idx = np.argmax(v < p.end_time)
    return p[p_idx]
df['period'] = df['transaction_dt'].apply(find_period)
df

   customer_id transaction_dt product  price  units     period
0            1     2004-01-02  thing1     25     47 2004-01-01
1            1     2004-01-17  thing2    150      8 2004-01-01
2            2     2004-01-29  thing2    150     25 2004-01-01
3            3     2017-07-15  thing3     55     17 2017-06-21
4            3     2016-05-12  thing3     55     47 2016-04-27
5            4     2012-02-23  thing2    150     22 2012-02-18
6            4     2009-10-10  thing1     25     12 2009-10-01
7            4     2014-04-04  thing2    150      2 2014-03-09
8            5     2008-07-09  thing2    150     43 2008-07-08

We can now use this dataframe to get the concatenation of products, weighted average of price and sum of units. We then use some of the Period functionality to get the end time.

def my_funcs(df):
    data = {}
    data['product'] = '/'.join(df['product'].tolist())
    data['units'] = df.units.sum()
    data['price'] = np.average(df['price'], weights=df['units'])
    data['transaction_dt'] = df['transaction_dt'].iloc[0]
    data['window_start_time'] = df['period'].iloc[0].start_time
    data['window_end_time'] = df['period'].iloc[0].end_time
    return pd.Series(data, index=['transaction_dt', 'product', 'price','units', 
                                  'window_start_time', 'window_end_time'])

df.groupby(['customer_id', 'period']).apply(my_funcs).reset_index('period', drop=True)

enter image description here

Farman answered 6/10, 2017 at 20:11 Comment(2)
This is a very elegant solution! Thank you! For readers, I also came up with an approach using a series of groupby/transforms to create the new values and then a final groupby which was faster but consumed more memory because it was not in-place.Traditionalism
I have revisited this again recently. I'm not sure if the 'window_start_time' and 'window_end_time' in my_funcs work properly, unfortunately. It seems to only return the last 30 day window dates for me.Traditionalism

© 2022 - 2024 — McMap. All rights reserved.