efficient way to find several rows above and below a subset of data

Asked 3/3, 2014 at 15:2 Answered 3/3, 2014 at 15:23

I'm wondering if there's an efficient way to get X number of rows below and above a subset of rows. I've created a basic implementation below, but I'm sure there's a better way. The subset that I care about is buyindex, which is the indices of rows that have the buy signal. I want to get several rows above and below the sellindex to verify that my algorithm is working correctly. How do I do it in an efficient way? My way seems roundabout.

buyindex = list(data2[data2['buy'] == True].index)

print buyindex [71, 102, 103, 179, 505, 506, 607]

buyindex1 = map(lambda x: x + 1, buyindex)
buyindex2 = map(lambda x: x - 1, buyindex)
buyindex3 = map(lambda x: x - 2, buyindex)
buyindex4 = map(lambda x: x + 2, buyindex)
buyindex.extend(buyindex1)
buyindex.extend(buyindex2)
buyindex.extend(buyindex3)
buyindex.extend(buyindex4)

buyindex.sort()
data2.iloc[buyindex]

UPDATE - this is the structure of the data. I have the indices of the "buys." but I basically want to get several indices above and below the buys.

VTI upper   lower   sell    buy AboveUpper  BelowLower  date    tokens_left
38   61.25   64.104107   61.341893   False   True    False   True   2007-02-28 00:00:00  5
39   61.08   64.218341   61.109659   False   True    False   True   2007-03-01 00:00:00  5
40   60.21   64.446719   60.640281   False   True    False   True   2007-03-02 00:00:00  5
41   59.51   64.717936   60.050064   False   True    False   True   2007-03-05 00:00:00  5
142  63.27   68.909776   64.310224   False   True    False   True   2007-07-27 00:00:00  5
217  62.98   68.858308   63.587692   False   True    False   True   2007-11-12 00:00:00  5
254  61.90   66.941126   61.944874   False   True    False   True   2008-01-07 00:00:00  5
255  60.79   67.049925   61.312075   False   True    False   True   2008-01-08 00:00:00  5
296  57.02   61.382677   57.371323   False   True    False   True   2008-03-07 00:00:00  5
297  56.15   61.709166   56.788834   False   True    False   True   2008-03-10 00:00:00  5

UPDATE: I created a general function based off the chosen answer. Let me know if you think this could be made even more efficient.

def get_test_index(df, column, numbers):  
    """
    builds an test index based on a range of numbers above and below the a specific index you want.
    df = dataframe to build off of 
    column = the column that is important to you. for instance, 'buy', or 'sell' 
    numbers = how many above and below you want of the important index 

    """

    idx_l = list(df[df[column] == True].index)
    for i in range(numbers)[1:]:
        idxpos = data2[column].shift(i).fillna(False)
        idxpos = list(df[idxpos].index)
        idx_l.extend(idxpos)

        idxneg = data2[column].shift(-i).fillna(False)
        idxneg = list(df[idxneg].index)
        idx_l.extend(idxneg)
    #print idx_l
    return sorted(idx_l)

Vada answered 3/3, 2014 at 15:2 Comment(2)

Could you show us some data and an example of what you want? So we can manipulate it. – Nomi 3/3, 2014 at 15:9

possible duplicate of index out of range error while working with lists in python – Diffuser 3/3, 2014 at 15:27

This will be a very efficient method

In [39]: df = DataFrame(np.random.randn(10,2))

In [41]: start=3

In [42]: stop=4

In [43]: df.iloc[(max(df.index.get_loc(start)-2,0)):min(df.index.get_loc(stop)+2,len(df))]
Out[43]: 
          0         1
1  0.348326  1.413770
2  1.898784  0.053780
3  0.825941 -1.986920
4  0.075956 -0.324657
5 -2.736800 -0.075813

[5 rows x 2 columns]

If you want essentially a function of arbitrary indexers, just create a list of the ones you want and pass to .iloc

In [18]: index_wanted = [71, 102, 103, 179, 505, 506, 607]

In [19]: from itertools import chain

In [20]: df = DataFrame(np.random.randn(1000,2))

You prob want unique ones

f = lambda i: [ i-2, i-1, i, i+1, i+2 ]

In [21]: indexers = Index(list(chain(*[ f(i) for i in [71, 102, 103, 179, 505, 506, 607] ]))).unique()

In [22]: df.iloc[indexers]
Out[22]: 
            0         1
69   0.792996  0.264597
70   1.084315 -0.620006
71  -0.030432  1.219576
72  -0.767855  0.765041
73  -0.637771 -0.103378
100 -1.087505  1.698133
101  1.007143  2.594046
102 -0.307440  0.308360
103  0.944429 -0.411742
104  1.332445 -0.149350
105  0.165213  1.125668
177  0.409580 -0.375709
178 -1.757021 -0.266762
179  0.736809 -1.286848
180  1.856241  0.176931
181 -0.492590  0.083519
503 -0.651788  0.717922
504 -1.612517 -1.729867
505 -1.786807 -0.066421
506  1.423571  0.768161
507  0.186871  1.162447
508  1.233441 -0.028261
605 -0.060117 -1.459827
606 -0.541765 -0.350981
607 -1.166172 -0.026404
608 -0.045338  1.641864
609 -0.337748  0.955940

[27 rows x 2 columns]

Roana answered 3/3, 2014 at 15:23 Comment(5)

is it possible to easily generalize this to an index that's 20 units long? I think this code is really clean. Yet if I have an index that has 20 different numbers out of a 1000 row long dataframe, it's not clear how you'd just get specific slices of the dataframe based on the sellindex ([71, 102, 103, 179, 505, 506, 607]) – Vada 3/3, 2014 at 21:24

updated...sure I misread your question, only thought you wanted one. It generalizes pretty easily. – Roana 3/3, 2014 at 21:55

Thanks Jeff. This looks really good. I'm reading about chain here: bioportal.weizmann.ac.il/course/python/PyMOTW/PyMOTW/docs/… Can you help me understand why chain is used here instead of zip? – Vada 3/3, 2014 at 22:4

sorry one more: if i wanted to create a function that specified the range I wanted, how could I adapt the line below so I wouldn't ahve to change it everytime? So for instance, if I wanted 3 or 4 above and below. indexers = Index(list(chain(*[ [i-2,i-1,i,i,i+1,i+2] for i in [71, 102, 103, 179, 505, 506, 607] ]))).unique() – Vada 3/3, 2014 at 22:8

I updated to show you how you can use a function. All chain does is put together lists (essentially flattening a list-of-lists to a single list) – Roana 3/3, 2014 at 22:56

you can use shift and | operator; for example for +/- 2 days you can do

idx = (data2['buy'] == True).fillna(False)
idx |= idx.shift(-1) | idx.shift(-2)   # one & two days after
idx |= idx.shift(1) | idx.shift(2)     # one & two days before
data2[ idx ] # this is what you need

Boyle answered 3/3, 2014 at 15:11 Comment(9)

Thanks Behzad. could you flesh this out a bit more? I keep getting this error: TypeError: unsupported operand type(s) for |: 'bool' and 'float'for ----> 3 idx |= data2.buy.shift(1) | data2.buy.shift(2) – Vada 3/3, 2014 at 21:6

can you help me understand what the | (is this called a pipe?) is doing here? I'm reading this link but it's not entirely clear to me: https://mcmap.net/q/75488/-pipe-character-in-python – Vada 3/3, 2014 at 21:12

**sorry for all the comments. I think I understand now. The Pipe acts like an "OR" and it allows you to assemble multiple indices that fit the condition. If you use "&" you'll get an empty because it will be difficult to find parts of the df that fit all the conditions? – Vada 3/3, 2014 at 21:18

However, your second line "idx |= data2.buy.shift(1) | data2.buy.shift(2)" still doesn't seem to work for me... – Vada 3/3, 2014 at 21:19

@Vada | here is basically pair-wise or; plz try my edited answer – Boyle 3/3, 2014 at 21:29

Hm, the |= is giving me problems still. However, I mae a general function of your answer (see my update above). Do you think this basically achieves the same thing? – Vada 3/3, 2014 at 21:48

I changed the answer to the chain function because it only requires one line. – Vada 3/3, 2014 at 22:14

Its very helpful! – Diarchy 5/6, 2018 at 14:51

what if I want to take all the days before? do we take idx.shift(len(df))? wouldnt it take just all the days and not all the days before? – Pettis 23/11, 2021 at 17:21

Recommended topics

Hot tags