Sliding window iterator using rolling in pandas
Asked Answered
M

3

9

If it's single row, I can get the iterator as following

import pandas as pd
import numpy as np

a = np.zeros((100,40))
X = pd.DataFrame(a)

for index, row in X.iterrows():
    print index
    print row

Now I want each iterator will return a subset X[0:9, :], X[5:14, :], X[10:19, :] etc. How do I achieve this with rolling (pandas.DataFrame.rolling)?

Mak answered 21/7, 2016 at 16:4 Comment(9)
what is X? what is rolling?Chamberlin
Please see updated question.Mak
OK. What does X[0:9] mean?Chamberlin
It's row 0 to 9 of X, that is X[0:9, 0:40]. I want to get iterator, instead of slicing as it is slow for big dataset.Mak
Why?? iterrows is slower than indexing?Mak
Yes, to my knowledge iterrows is always slower than using index slicing. How big is your dataset.Yanirayank
It's about 2000000x500 intsMak
You may need to prune the 500 columns. Work on a using only what you need vs everything.Yanirayank
I updated my solution to return the iterable such that you don't need to remember to stack.Berkow
B
7

I'll experiment with the following dataframe.

Setup

import pandas as pd
import numpy as np
from string import uppercase

def generic_portfolio_df(start, end, freq, num_port, num_sec, seed=314):
    np.random.seed(seed)
    portfolios = pd.Index(['Portfolio {}'.format(i) for i in uppercase[:num_port]],
                          name='Portfolio')
    securities = ['s{:02d}'.format(i) for i in range(num_sec)]
    dates = pd.date_range(start, end, freq=freq)
    return pd.DataFrame(np.random.rand(len(dates) * num_sec, num_port),
                        index=pd.MultiIndex.from_product([dates, securities],
                                                         names=['Date', 'Id']),
                        columns=portfolios
                       ).groupby(level=0).apply(lambda x: x / x.sum())    


df = generic_portfolio_df('2014-12-31', '2015-05-30', 'BM', 3, 5)

df.head(10)

enter image description here

I'll now introduce a function to roll a number of rows and concatenate into a single dataframe where I'll add a top level to the column index that indicates the location in the roll.

Solution Step-1

def rolled(df, n):
    k = range(df.columns.nlevels)
    _k = [i - len(k) for i in k]
    myroll = pd.concat([df.shift(i).stack(level=k) for i in range(n)],
                       axis=1, keys=range(n)).unstack(level=_k)
    return [(i, row.unstack(0)) for i, row in myroll.iterrows()]

Though its hidden in the function, myroll would look like this

enter image description here

Now we can use it just like an iterator.

Solution Step-2

for i, roll in rolled(df.head(5), 3):
    print roll
    print

                    0   1   2
Portfolio                    
Portfolio A  0.326164 NaN NaN
Portfolio B  0.201597 NaN NaN
Portfolio C  0.085340 NaN NaN

                    0         1   2
Portfolio                          
Portfolio A  0.278614  0.326164 NaN
Portfolio B  0.314448  0.201597 NaN
Portfolio C  0.266392  0.085340 NaN

                    0         1         2
Portfolio                                
Portfolio A  0.258958  0.278614  0.326164
Portfolio B  0.089224  0.314448  0.201597
Portfolio C  0.293570  0.266392  0.085340

                    0         1         2
Portfolio                                
Portfolio A  0.092760  0.258958  0.278614
Portfolio B  0.262511  0.089224  0.314448
Portfolio C  0.084208  0.293570  0.266392

                    0         1         2
Portfolio                                
Portfolio A  0.043503  0.092760  0.258958
Portfolio B  0.132221  0.262511  0.089224
Portfolio C  0.270490  0.084208  0.293570
Berkow answered 21/7, 2016 at 16:53 Comment(1)
Did you look at np.strides at all? Might use the above on something that Im working on.Yanirayank
P
1

That's not how rolling works. It "provides rolling transformations" (from the docs).

You can loop and use pandas indexing?

for i in range((X.shape[0] + 9) // 10):
    X_subset = X.iloc[i * 10: (i + 1) * 10]
Pittsburgh answered 21/7, 2016 at 16:21 Comment(1)
How can I get an iterator instead, similar to iterrows()?Mak
C
1

It can be done with rolling but is inefficient (due to iterating all windows) and requires Pandas 1.1:

for index, row in enumerate(list(X.rolling(10))[::5]):

Bit faster (iteration in C)

from itertools import islice
for index, row in enumerate(islice(X.rolling(10), None, None, 5)):
Cookhouse answered 11/2, 2022 at 13:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.