Finding matching interval(s) in pandas Intervalindex

Asked 22/9, 2017 at 12:21 Answered 19/11, 2020 at 15:29

There's this interesting API called Intervalindex new in 0.20 that lets you create an index of intervals.

Given some sample data:

data = [(893.1516130000001, 903.9187099999999),
 (882.384516, 893.1516130000001),
 (817.781935, 828.549032)]

You can create the index like this:

idx = pd.IntervalIndex.from_tuples(data)

print(idx)
IntervalIndex([(893.151613, 903.91871], (882.384516, 893.151613], (817.781935, 828.549032]]
              closed='right',
              dtype='interval[float64]')

An interesting property of Intervals is that you can perform interval checks with in:

print(y[-1])
Interval(817.78193499999998, 828.54903200000001, closed='right')

print(820 in y[-1])
True

print(1000 in y[-1])
False

I would like to know how to apply this operation to the entire index. For example, given some number 900, how could I retrieve a boolean mask of intervals for which this number fits in?

I can think of:

m = [900 in y for y in idx]
print(m)
[True, False, False]

Are there better ways to do this?

Waligore answered 22/9, 2017 at 12:21 Comment(5)

I'm not aware of, is something amiss with your way? – Seleucid 22/9, 2017 at 12:35

@Seleucid It seems like a useful function to have, so I thought there'd be something of the sort. The only thing amiss with the list comprehension is the loop ;-/ – Waligore 22/9, 2017 at 12:37

@Bharathshetty I am a noob. I don't know what's good and what's bad! – Waligore 22/9, 2017 at 13:40

@cᴏʟᴅsᴘᴇᴇᴅ why does get_loc doesnt work with datetimes – Dissoluble 30/10, 2017 at 13:29

@Bharath It should... I'm not experienced enough with this API to tell you why :( – Waligore 30/10, 2017 at 20:44

If you are interested in performance, an IntervalIndex is optimized for searching. using .get_loc or .get_indexer uses an internally built IntervalTree (like a binary tree), which is constructed on first use.

In [29]: idx = pd.IntervalIndex.from_tuples(data*10000)

In [30]: %timeit -n 1 -r 1 idx.map(lambda x: 900 in x)
92.8 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

In [40]: %timeit -n 1 -r 1 idx.map(lambda x: 900 in x)
42.7 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

# construct tree and search
In [31]: %timeit -n 1 -r 1 idx.get_loc(900)
4.55 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

# subsequently
In [32]: %timeit -n 1 -r 1 idx.get_loc(900)
137 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

# for a single indexer you can do even better (note that this is
# dipping into the impl a bit
In [27]: %timeit np.arange(len(idx))[(900 > idx.left) & (900 <= idx.right)]
203 µs ± 1.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Note that .get_loc() returns an indexer (which is actually more useful than a boolean array, but they are convertible to each other).

In [38]: idx.map(lambda x: 900 in x)
    ...: 
Out[38]: 
Index([ True, False, False,  True, False, False,  True, False, False,  True,
       ...
       False,  True, False, False,  True, False, False,  True, False, False], dtype='object', length=30000)

In [39]: idx.get_loc(900)
    ...: 
Out[39]: array([29997,  9987, 10008, ..., 19992, 19989,     0])

Returning a boolean array is converted to an array of indexers

In [5]: np.arange(len(idx))[idx.map(lambda x: 900 in x).values.astype(bool)]
Out[5]: array([    0,     3,     6, ..., 29991, 29994, 29997])

This is what .get_loc() and .get_indexer() return:

In [6]: np.sort(idx.get_loc(900))
Out[6]: array([    0,     3,     6, ..., 29991, 29994, 29997])

Kb answered 22/9, 2017 at 19:46 Comment(4)

Thank you! This is just amazing. So, just so I’ve understood, get_loc builds a tree that is internally cached for faster searches later? – Waligore 22/9, 2017 at 23:44

Also, you mentioned that they are convertible to each other. How exactly does the second output map to the first? – Waligore 22/9, 2017 at 23:46

yes an IntervalTree is built internally, this is the way of indexing; we normally build a hashtable for other index types to do lookups, this structure yield lots of benefits here though. Updated with how to convert between indexers. – Kb 22/9, 2017 at 23:53

In newer versions of Pandas (2.0+) it seems like get_loc now only accepts and returns scalars, and get_indexer now only accepts and returns sequences. get_loc([900]) is an error, as is get_indexer(900). – Choiseul 6/2 at 17:46

If you're looking for speed, you can use left and right of idx i.e getting lower bound and upper bound from the range then check if number falls between the bounds i.e

list(lower <= 900 <= upper for (lower, upper) in zip(idx.left,idx.right))

[(900 > idx.left) & (900 <= idx.right)]

[True, False, False]

For small data

%%timeit
list(lower <= 900 <= upper for (lower, upper) in zip(idx.left,idx.right))
100000 loops, best of 3: 11.26 µs per loop

%%timeit
[900 in y for y in idx]
100000 loops, best of 3: 9.26 µs per loop

For large data

idx = pd.IntervalIndex.from_tuples(data*10000)

%%timeit
list(lower <= 900 <= upper for (lower, upper) in zip(idx.left,idx.right))
10 loops, best of 3: 29.2 ms per loop

%%timeit
[900 in y for y in idx]
10 loops, best of 3: 64.6 ms per loop

This method beats your solution for large data.

Dissoluble answered 22/9, 2017 at 14:2 Comment(2)

I think they are not documented yet but you can see the code github.com/pandas-dev/pandas/blob/v0.20.3/pandas/core/indexes/… – Dissoluble 22/9, 2017 at 14:10

Jeff is one of the core developers of pandas. His word is law ... literally... but thanks for understanding :-) – Waligore 23/9, 2017 at 13:35

You can use map:

idx.map(lambda x: 900 in x)
#Index([True, False, False], dtype='object')

Timings:

%timeit [900 in y for y in idx]
#100000 loops, best of 3: 3.76 µs per loop

%timeit idx.map(lambda x: 900 in x)
#10000 loops, best of 3: 48.7 µs per loop

%timeit map(lambda x: 900 in x, idx)
#100000 loops, best of 3: 4.95 µs per loop

Obviously, comprehension is the fastest but builtin map doesn't fall too far behind.

Results even out when we introduce more data, to be precise 10K times more data:

%timeit [900 in y for y in idx]
#10 loops, best of 3: 26.8 ms per loop

%timeit idx.map(lambda x: 900 in x)
#10 loops, best of 3: 30 ms per loop

%timeit map(lambda x: 900 in x, idx)
#10 loops, best of 3: 29.5 ms per loop

As we see, builtin map comes very close to .map() so - lets see what happens with 10 times even more data:

%timeit [900 in y for y in idx]
#1 loop, best of 3: 270 ms per loop

%timeit idx.map(lambda x: 900 in x)
#1 loop, best of 3: 299 ms per loop

%timeit map(lambda x: 900 in x, idx)
#1 loop, best of 3: 291 ms per loop

Conclusion:

comprehension is the winner but not so distinct on larger amounts of data.

Uncomfortable answered 22/9, 2017 at 12:48 Comment(3)

Thanks for the answer, but can you prove to me that a map is better than the list comprehension? I'd love to see timings on this, if you don't mind! With some large data too. – Waligore 22/9, 2017 at 12:50

comprehension is faster :) – Uncomfortable 22/9, 2017 at 12:54

Very disappointing. I liked your answer. Still, I'd love to see the timings. – Waligore 22/9, 2017 at 12:56

using NumPy

import numpy as np
data = [(893.1516130000001, 903.9187099999999),
         (882.384516, 893.1516130000001),
         (817.781935, 828.549032)]    
q = 900
# The next line broadcast q and tell if q is within the intervals/ranges defined in data (using numpy)
np.logical_xor(*(np.array(data) - q > 0).transpose())

Paedo answered 19/11, 2020 at 15:29 Comment(1)

try to add some context to boost answer quality. EoR. – Cari 19/11, 2020 at 16:42

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags