efficient way to find several rows above and below a subset of data
Asked Answered
V

2

6

I'm wondering if there's an efficient way to get X number of rows below and above a subset of rows. I've created a basic implementation below, but I'm sure there's a better way. The subset that I care about is buyindex, which is the indices of rows that have the buy signal. I want to get several rows above and below the sellindex to verify that my algorithm is working correctly. How do I do it in an efficient way? My way seems roundabout.

buyindex = list(data2[data2['buy'] == True].index)

print buyindex [71, 102, 103, 179, 505, 506, 607]

buyindex1 = map(lambda x: x + 1, buyindex)
buyindex2 = map(lambda x: x - 1, buyindex)
buyindex3 = map(lambda x: x - 2, buyindex)
buyindex4 = map(lambda x: x + 2, buyindex)
buyindex.extend(buyindex1)
buyindex.extend(buyindex2)
buyindex.extend(buyindex3)
buyindex.extend(buyindex4)

buyindex.sort()
data2.iloc[buyindex]

UPDATE - this is the structure of the data. I have the indices of the "buys." but I basically want to get several indices above and below the buys.

VTI upper   lower   sell    buy AboveUpper  BelowLower  date    tokens_left
38   61.25   64.104107   61.341893   False   True    False   True   2007-02-28 00:00:00  5
39   61.08   64.218341   61.109659   False   True    False   True   2007-03-01 00:00:00  5
40   60.21   64.446719   60.640281   False   True    False   True   2007-03-02 00:00:00  5
41   59.51   64.717936   60.050064   False   True    False   True   2007-03-05 00:00:00  5
142  63.27   68.909776   64.310224   False   True    False   True   2007-07-27 00:00:00  5
217  62.98   68.858308   63.587692   False   True    False   True   2007-11-12 00:00:00  5
254  61.90   66.941126   61.944874   False   True    False   True   2008-01-07 00:00:00  5
255  60.79   67.049925   61.312075   False   True    False   True   2008-01-08 00:00:00  5
296  57.02   61.382677   57.371323   False   True    False   True   2008-03-07 00:00:00  5
297  56.15   61.709166   56.788834   False   True    False   True   2008-03-10 00:00:00  5

UPDATE: I created a general function based off the chosen answer. Let me know if you think this could be made even more efficient.

def get_test_index(df, column, numbers):  
    """
    builds an test index based on a range of numbers above and below the a specific index you want.
    df = dataframe to build off of 
    column = the column that is important to you. for instance, 'buy', or 'sell' 
    numbers = how many above and below you want of the important index 

    """

    idx_l = list(df[df[column] == True].index)
    for i in range(numbers)[1:]:
        idxpos = data2[column].shift(i).fillna(False)
        idxpos = list(df[idxpos].index)
        idx_l.extend(idxpos)

        idxneg = data2[column].shift(-i).fillna(False)
        idxneg = list(df[idxneg].index)
        idx_l.extend(idxneg)
    #print idx_l
    return sorted(idx_l)
Vada answered 3/3, 2014 at 15:2 Comment(2)
Could you show us some data and an example of what you want? So we can manipulate it.Nomi
possible duplicate of index out of range error while working with lists in pythonDiffuser
R
3

This will be a very efficient method

In [39]: df = DataFrame(np.random.randn(10,2))

In [41]: start=3

In [42]: stop=4

In [43]: df.iloc[(max(df.index.get_loc(start)-2,0)):min(df.index.get_loc(stop)+2,len(df))]
Out[43]: 
          0         1
1  0.348326  1.413770
2  1.898784  0.053780
3  0.825941 -1.986920
4  0.075956 -0.324657
5 -2.736800 -0.075813

[5 rows x 2 columns]

If you want essentially a function of arbitrary indexers, just create a list of the ones you want and pass to .iloc

In [18]: index_wanted = [71, 102, 103, 179, 505, 506, 607]

In [19]: from itertools import chain

In [20]: df = DataFrame(np.random.randn(1000,2))

You prob want unique ones

f = lambda i: [ i-2, i-1, i, i+1, i+2 ]

In [21]: indexers = Index(list(chain(*[ f(i) for i in [71, 102, 103, 179, 505, 506, 607] ]))).unique()

In [22]: df.iloc[indexers]
Out[22]: 
            0         1
69   0.792996  0.264597
70   1.084315 -0.620006
71  -0.030432  1.219576
72  -0.767855  0.765041
73  -0.637771 -0.103378
100 -1.087505  1.698133
101  1.007143  2.594046
102 -0.307440  0.308360
103  0.944429 -0.411742
104  1.332445 -0.149350
105  0.165213  1.125668
177  0.409580 -0.375709
178 -1.757021 -0.266762
179  0.736809 -1.286848
180  1.856241  0.176931
181 -0.492590  0.083519
503 -0.651788  0.717922
504 -1.612517 -1.729867
505 -1.786807 -0.066421
506  1.423571  0.768161
507  0.186871  1.162447
508  1.233441 -0.028261
605 -0.060117 -1.459827
606 -0.541765 -0.350981
607 -1.166172 -0.026404
608 -0.045338  1.641864
609 -0.337748  0.955940

[27 rows x 2 columns]
Roana answered 3/3, 2014 at 15:23 Comment(5)
is it possible to easily generalize this to an index that's 20 units long? I think this code is really clean. Yet if I have an index that has 20 different numbers out of a 1000 row long dataframe, it's not clear how you'd just get specific slices of the dataframe based on the sellindex ([71, 102, 103, 179, 505, 506, 607])Vada
updated...sure I misread your question, only thought you wanted one. It generalizes pretty easily.Roana
Thanks Jeff. This looks really good. I'm reading about chain here: bioportal.weizmann.ac.il/course/python/PyMOTW/PyMOTW/docs/… Can you help me understand why chain is used here instead of zip?Vada
sorry one more: if i wanted to create a function that specified the range I wanted, how could I adapt the line below so I wouldn't ahve to change it everytime? So for instance, if I wanted 3 or 4 above and below. indexers = Index(list(chain(*[ [i-2,i-1,i,i,i+1,i+2] for i in [71, 102, 103, 179, 505, 506, 607] ]))).unique()Vada
I updated to show you how you can use a function. All chain does is put together lists (essentially flattening a list-of-lists to a single list)Roana
B
2

you can use shift and | operator; for example for +/- 2 days you can do

idx = (data2['buy'] == True).fillna(False)
idx |= idx.shift(-1) | idx.shift(-2)   # one & two days after
idx |= idx.shift(1) | idx.shift(2)     # one & two days before
data2[ idx ] # this is what you need  
Boyle answered 3/3, 2014 at 15:11 Comment(9)
Thanks Behzad. could you flesh this out a bit more? I keep getting this error: TypeError: unsupported operand type(s) for |: 'bool' and 'float'for ----> 3 idx |= data2.buy.shift(1) | data2.buy.shift(2)Vada
can you help me understand what the | (is this called a pipe?) is doing here? I'm reading this link but it's not entirely clear to me: https://mcmap.net/q/75488/-pipe-character-in-pythonVada
**sorry for all the comments. I think I understand now. The Pipe acts like an "OR" and it allows you to assemble multiple indices that fit the condition. If you use "&" you'll get an empty because it will be difficult to find parts of the df that fit all the conditions?Vada
However, your second line "idx |= data2.buy.shift(1) | data2.buy.shift(2)" still doesn't seem to work for me...Vada
@Vada | here is basically pair-wise or; plz try my edited answerBoyle
Hm, the |= is giving me problems still. However, I mae a general function of your answer (see my update above). Do you think this basically achieves the same thing?Vada
I changed the answer to the chain function because it only requires one line.Vada
Its very helpful!Diarchy
what if I want to take all the days before? do we take idx.shift(len(df))? wouldnt it take just all the days and not all the days before?Pettis

© 2022 - 2024 — McMap. All rights reserved.