Select rows with interval

Asked 7/2, 2018 at 20:18 Answered 7/2, 2018 at 20:58

Solved python pandas dataframe intervals

I have a wide Pandas dataframe with TimeIndexed values and I wanted to select with an Interval object that I made:

inter = pd.Interval(pd.Timestamp('2017-12-05 16:36:17'),
                    pd.Timestamp('2017-12-05 22:00:00'), closed='left')

I tried loc and iloc method but they don't accept Interval instance as argument.

I can test if a Timestamp is in that Interval that way:

pd.Timestamp('2017-12-05 22:00:00') in inter

But I'm not able to write a line to select the row of the dataframe.

Navada answered 7/2, 2018 at 20:18 Comment(1)

One approach df.loc[inter.left:inter.right], although this won't respect closed='left' or closed='right'. – Siu 7/2, 2018 at 20:25

Setup

s = pd.Series(
      pd.date_range('2017-12-05 16:00:00', '2017-12-05 23:00:00', freq='H')
)
s

0   2017-12-05 16:00:00
1   2017-12-05 17:00:00
2   2017-12-05 18:00:00
3   2017-12-05 19:00:00
4   2017-12-05 20:00:00
5   2017-12-05 21:00:00
6   2017-12-05 22:00:00
7   2017-12-05 23:00:00
dtype: datetime64[ns]

Here's how to solve this problem, for all 4 cases of interval inclusion.

closed='left'

(inter.left <= s) & (s < inter.right)

0    False
1     True
2     True
3     True
4     True
5     True
6    False
7    False
dtype: bool

closed='right'

(inter.left < s) & (s <= inter.right)

0    False
1     True
2     True
3     True
4     True
5     True
6     True
7    False
dtype: bool

closed='neither'

(inter.left < s) & (s < inter.right)

0    False
1     True
2     True
3     True
4     True
5     True
6    False
7    False
dtype: bool

closed='both' (use pd.Series.between, should be slightly more efficient).

s.between(inter.left, inter.right, inclusive=True) 

0    False
1     True
2     True
3     True
4     True
5     True
6     True
7    False
dtype: bool

After computing these quantities using any of these methods, selecting rows is as easy s[mask] where mask is the boolean mask we've just computed above.

Saltine answered 7/2, 2018 at 20:40 Comment(9)

I really like this answer. This is a great explanation. +2 if you finish it up with loc or mask – Offhand 7/2, 2018 at 20:45

this must be very fast! – Apocope 7/2, 2018 at 20:46

@MaxU I think it should be. between should be even better (I'm hoping... haven't tested, haha). – Saltine 7/2, 2018 at 20:46

@cᴏʟᴅsᴘᴇᴇᴅ, you must've just added it ;) haha. – Offhand 7/2, 2018 at 20:47

@cᴏʟᴅsᴘᴇᴇᴅ my testing also shows that between is MUCH faster – Offhand 7/2, 2018 at 21:0

But, I'm looking for a builtin method to pass an Intervall directly. If there is not, maybe I should wright according to your exemple. – Navada 8/2, 2018 at 21:16

@Navada There isn't any such builtin besides pd.Series.between, but that only handles one of the 4 possible cases that may arise. And nothing yet to handle interval objects directly, as-is. – Saltine 8/2, 2018 at 21:17

Anyidea on how to write this method clearly ? I'm trying but I failed. It should and Interval argument in input and return a mask – Navada 8/2, 2018 at 21:36

@Navada For any followup questions, you can write a new post with your new problem. – Saltine 8/2, 2018 at 21:38

Here is an example I have. We can use loc and I will walk you through it step by step:

import pandas as pd
inter = pd.Interval(pd.Timestamp('2017-12-05 16:36:17'),
                    pd.Timestamp('2017-12-05 22:00:00'), closed='left')
# creating a dataframe of different dates ranging from 12/03 to 12/07
df3 = pd.DataFrame({'Dates':pd.date_range(pd.Timestamp('2017-12-03 16:36:17'), 
      pd.Timestamp('2017-12-07 22:00:00'), freq='H')})

# creating a column to see if the data is in between the interval you created.
df3['In?'] = df3['Dates'].apply(lambda x: x in inter)

#filtering that dataframe 
df3.loc[df3['In?'] ==True]

Now you can probably skip the creating of the In? column and just do the filtering straight away, but I wanted you to see the steps

df3.loc[df3['Dates'].apply(lambda x: x in inter) == True] is how to do it without creating in using the apply() method

Offhand answered 7/2, 2018 at 20:39 Comment(5)

Try this one: df3.query("@inter.left <= Dates < @inter.right") - it should be faster than .apply – Apocope 7/2, 2018 at 20:46

@MaxU apply() got 100 loops, best of 3: 1.33 ms per loop where query() got 100 loops, best of 3: 2.23 ms per loop. Interesting – Offhand 7/2, 2018 at 20:49

try to test it on bigger DF's: df3 = pd.concat([df3]*1000, ignore_index=True) ;-) – Apocope 7/2, 2018 at 20:52

@MaxU. Slightly... 524 ms per loop vs 4.16 ms per loop haha. cᴏʟᴅsᴘᴇᴇᴅ's solution results 2 ms per loop for the digger x1000 dataframe – Offhand 7/2, 2018 at 20:56

yeah, i've added timing for all three options... ;-) – Apocope 7/2, 2018 at 20:59

Borrowing sample data set from @MattR's answer:

In [114]: df3.query("@inter.left <= Dates < @inter.right")
Out[114]:
                 Dates
48 2017-12-05 16:36:17
49 2017-12-05 17:36:17
50 2017-12-05 18:36:17
51 2017-12-05 19:36:17
52 2017-12-05 20:36:17
53 2017-12-05 21:36:17

Timings for ~100K rows DF:

In [109]: df = pd.concat([df3]*1000, ignore_index=True)

In [110]: df.shape
Out[110]: (102000, 1)

In [111]: %timeit df.query("@inter.left <= Dates < @inter.right")
9.1 ms ± 20.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [112]: %timeit df.loc[df['Dates'].apply(lambda x: x in inter) == True]
1.54 s ± 48.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [113]: %timeit df[df['Dates'].between(inter.left, inter.right, inclusive=True)]
3.96 ms ± 43.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Apocope answered 7/2, 2018 at 20:58 Comment(0)

Recommended topics

Hot tags