Select rows with interval
Asked Answered
N

3

5

I have a wide Pandas dataframe with TimeIndexed values and I wanted to select with an Interval object that I made:

inter = pd.Interval(pd.Timestamp('2017-12-05 16:36:17'),
                    pd.Timestamp('2017-12-05 22:00:00'), closed='left')

I tried loc and iloc method but they don't accept Interval instance as argument.

I can test if a Timestamp is in that Interval that way:

pd.Timestamp('2017-12-05 22:00:00') in inter

But I'm not able to write a line to select the row of the dataframe.

Navada answered 7/2, 2018 at 20:18 Comment(1)
One approach df.loc[inter.left:inter.right], although this won't respect closed='left' or closed='right'.Siu
S
4

Setup

s = pd.Series(
      pd.date_range('2017-12-05 16:00:00', '2017-12-05 23:00:00', freq='H')
)
s

0   2017-12-05 16:00:00
1   2017-12-05 17:00:00
2   2017-12-05 18:00:00
3   2017-12-05 19:00:00
4   2017-12-05 20:00:00
5   2017-12-05 21:00:00
6   2017-12-05 22:00:00
7   2017-12-05 23:00:00
dtype: datetime64[ns]

Here's how to solve this problem, for all 4 cases of interval inclusion.

  1. closed='left'

    (inter.left <= s) & (s < inter.right)
    
    0    False
    1     True
    2     True
    3     True
    4     True
    5     True
    6    False
    7    False
    dtype: bool
    
  2. closed='right'

    (inter.left < s) & (s <= inter.right)
    
    0    False
    1     True
    2     True
    3     True
    4     True
    5     True
    6     True
    7    False
    dtype: bool
    
  3. closed='neither'

    (inter.left < s) & (s < inter.right)
    
    0    False
    1     True
    2     True
    3     True
    4     True
    5     True
    6    False
    7    False
    dtype: bool
    
  4. closed='both' (use pd.Series.between, should be slightly more efficient).

    s.between(inter.left, inter.right, inclusive=True) 
    
    0    False
    1     True
    2     True
    3     True
    4     True
    5     True
    6     True
    7    False
    dtype: bool
    

After computing these quantities using any of these methods, selecting rows is as easy s[mask] where mask is the boolean mask we've just computed above.

Saltine answered 7/2, 2018 at 20:40 Comment(9)
I really like this answer. This is a great explanation. +2 if you finish it up with loc or maskOffhand
this must be very fast!Apocope
@MaxU I think it should be. between should be even better (I'm hoping... haven't tested, haha).Saltine
@cᴏʟᴅsᴘᴇᴇᴅ, you must've just added it ;) haha.Offhand
@cᴏʟᴅsᴘᴇᴇᴅ my testing also shows that between is MUCH fasterOffhand
But, I'm looking for a builtin method to pass an Intervall directly. If there is not, maybe I should wright according to your exemple.Navada
@Navada There isn't any such builtin besides pd.Series.between, but that only handles one of the 4 possible cases that may arise. And nothing yet to handle interval objects directly, as-is.Saltine
Anyidea on how to write this method clearly ? I'm trying but I failed. It should and Interval argument in input and return a maskNavada
@Navada For any followup questions, you can write a new post with your new problem.Saltine
O
2

Here is an example I have. We can use loc and I will walk you through it step by step:

import pandas as pd
inter = pd.Interval(pd.Timestamp('2017-12-05 16:36:17'),
                    pd.Timestamp('2017-12-05 22:00:00'), closed='left')
# creating a dataframe of different dates ranging from 12/03 to 12/07
df3 = pd.DataFrame({'Dates':pd.date_range(pd.Timestamp('2017-12-03 16:36:17'), 
      pd.Timestamp('2017-12-07 22:00:00'), freq='H')})

# creating a column to see if the data is in between the interval you created.
df3['In?'] = df3['Dates'].apply(lambda x: x in inter)

#filtering that dataframe 
df3.loc[df3['In?'] ==True]

Now you can probably skip the creating of the In? column and just do the filtering straight away, but I wanted you to see the steps

df3.loc[df3['Dates'].apply(lambda x: x in inter) == True] is how to do it without creating in using the apply() method

Offhand answered 7/2, 2018 at 20:39 Comment(5)
Try this one: df3.query("@inter.left <= Dates < @inter.right") - it should be faster than .applyApocope
@MaxU apply() got 100 loops, best of 3: 1.33 ms per loop where query() got 100 loops, best of 3: 2.23 ms per loop. InterestingOffhand
try to test it on bigger DF's: df3 = pd.concat([df3]*1000, ignore_index=True) ;-)Apocope
@MaxU. Slightly... 524 ms per loop vs 4.16 ms per loop haha. cᴏʟᴅsᴘᴇᴇᴅ's solution results 2 ms per loop for the digger x1000 dataframeOffhand
yeah, i've added timing for all three options... ;-)Apocope
A
1

Borrowing sample data set from @MattR's answer:

In [114]: df3.query("@inter.left <= Dates < @inter.right")
Out[114]:
                 Dates
48 2017-12-05 16:36:17
49 2017-12-05 17:36:17
50 2017-12-05 18:36:17
51 2017-12-05 19:36:17
52 2017-12-05 20:36:17
53 2017-12-05 21:36:17

Timings for ~100K rows DF:

In [109]: df = pd.concat([df3]*1000, ignore_index=True)

In [110]: df.shape
Out[110]: (102000, 1)

In [111]: %timeit df.query("@inter.left <= Dates < @inter.right")
9.1 ms ± 20.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [112]: %timeit df.loc[df['Dates'].apply(lambda x: x in inter) == True]
1.54 s ± 48.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [113]: %timeit df[df['Dates'].between(inter.left, inter.right, inclusive=True)]
3.96 ms ± 43.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Apocope answered 7/2, 2018 at 20:58 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.