Find numbers within a range bisect python
Asked Answered
Z

8

6

I have a list of integer numbers and I want to write a function that returns a subset of numbers that are within a range. Something like NumbersWithinRange(list, interval) function name...

I.e.,

list = [4,2,1,7,9,4,3,6,8,97,7,65,3,2,2,78,23,1,3,4,5,67,8,100]
interval = [4,20]
results = NumbersWithinRange(list, interval)  # [4,4,6,8,7,8]

maybe i forgot to write one more number in results, but that's the idea...

The list can be as big as 10/20 million length, and the range is normally of a few 100.

Any suggestions on how to do it efficiently with python - I was thinking to use bisect.

Thanks.

Zoroaster answered 24/8, 2012 at 14:0 Comment(2)
You should not use list as a variable name. Python lets you (silently) reassign the built-in list constructor if you do...Swagsman
correct. it was just for the example, i wouldn't use that name in code. thanks for the correction.Zoroaster
D
7

I would use numpy for that, especially if the list is that long. For example:

In [101]: list = np.array([4,2,1,7,9,4,3,6,8,97,7,65,3,2,2,78,23,1,3,4,5,67,8,100])
In [102]: list
Out[102]: 
array([  4,   2,   1,   7,   9,   4,   3,   6,   8,  97,   7,  65,   3,
         2,   2,  78,  23,   1,   3,   4,   5,  67,   8, 100])
In [103]: good = np.where((list > 4) & (list < 20)) 
In [104]: list[good]
Out[104]: array([7, 9, 6, 8, 7, 5, 8])

# %timeit says that numpy is MUCH faster than any list comprehension: 
# create an array 10**6 random ints b/w 0 and 100
In [129]: arr = np.random.randint(0,100,1000000)
In [130]: interval = xrange(4,21)
In [126]: %timeit r = [x for x in arr if x in interval]
1 loops, best of 3: 14.2 s per loop

In [136]: %timeit good = np.where((list > 4) & (list < 20)) ; new_list = list[good]
100 loops, best of 3: 10.8 ms per loop

In [134]: %timeit r = [x for x in arr if 4 < x < 20]
1 loops, best of 3: 2.22 s per loop 

In [142]: %timeit filtered = [i for i in ifilter(lambda x: 4 < x < 20, arr)]
1 loops, best of 3: 2.56 s per loop
Distiller answered 24/8, 2012 at 14:5 Comment(8)
How does the list comprehension do when you use 4 <= x <= 20 instead of x in interval? Checking if the value is in an xrange iterator is going to slow things down.Hothouse
yeah, its faster by a factor of 6, but still about 4 orders of magnitude slower than numpyDistiller
Ah, yes, I see now. I wasn't paying close attention to the units (s vs ms). numpy is indeed very much faster.Hothouse
your timeit doesn't include arr[good]. Enumerating items one by one in numpy array is slow, Python list should be faster. Also OP says that the resulting array should have ~ 100 items; use larger limits for randint.Margaretmargareta
No, this is always going to be faster in numpy that with a Python list due to strong typing in a numpy array. Added arr[good], no performance hitDistiller
@chepner, yup, its 10 millisec with numpy vs 2 seconds for pure pythonDistiller
nice discussion created here, I'll look into both for my code. In my case I'll be calling this function million's of times so I really need it to be fast. I'll try them out. Thanks a lot!Zoroaster
@user1443118: I meant that [x for x in np_array if f(x)] might be slower than [x for x in py_list if f(x)]. Naturally vectorized operations with numpy arrays that exclude pure Python loop should be faster.Margaretmargareta
H
6

The pure-Python Python sortedcontainers module has a SortedList type that can help you. It maintains the list automatically in sorted order and has been tested passed tens of millions of elements. The sorted list type has a bisect function you can use.

from sortedcontainers import SortedList
data = SortedList(...)

def NumbersWithinRange(items, lower, upper):
    start = items.bisect(lower)
    end = items.bisect_right(upper)
    return items[start:end]

subset = NumbersWithinRange(data, 4, 20)

Bisecting and indexing will be much faster this way than scanning the entire list. The sorted containers module is very fast and has a performance comparison page with benchmarks against alternative implementations.

Headcloth answered 10/4, 2014 at 23:46 Comment(0)
H
5

If the list isn't sorted, you need to scan the entire list:

lst = [ 4,2,1,...]
interval=[4,20]
results = [ x for x in lst if interval[0] <= x <= interval[1] ]

If the list is sorted, you can use bisect to find the left and right indices that bound your range.

left = bisect.bisect_left(lst, interval[0])
right = bisect.bisect_right(lst, interval[1])

results = lst[left+1:right]

Since scanning the list is O(n) and sorting is O(n lg n), it probably is not worth sorting the list just to use bisect unless you plan on doing lots of range extractions.

Hothouse answered 24/8, 2012 at 14:8 Comment(2)
yes, I am planning on doing 100 or more million's of extractions, does it sound realistic or it'd take forever?Zoroaster
Hard to say, but use the numpy answer, as it is far faster than a pure-python solution.Hothouse
G
2

I think this should be sufficiently efficient:

>>> nums = [4,2,1,7,9,4,3,6,8,97,7,65,3,2,2,78,23,1,3,4,5,67,8,100]
>>> r = [x for x in nums if 4 <= x <21]
>>> r
[4, 7, 9, 4, 6, 8, 7, 4, 5, 8]

Edit:

After J.F. Sebastian's excellent observation, modified the code.

Gauze answered 24/8, 2012 at 14:6 Comment(3)
Thats going to work, but be slow if the list really is of order 10^6Distiller
Its not clarified if the list is sorted; if it is then bisect should be faster.Gauze
i in xrange is not optimized (unlike i in range on Python 3). It is the same as i in iterable i.e., it enumerates the values one by one. So 4 <= x < 21 should be used insteadMargaretmargareta
L
1

Using iterators

>>> from itertools import ifilter
>>> A = [4,2,1,7,9,4,3,6,8,97,7,65,3,2,2,78,23,1,3,4,5,67,8,100]
>>> [i for i in ifilter(lambda x: 4 < x < 20, A)]
[7, 9, 6, 8, 7, 5, 8]
Latea answered 24/8, 2012 at 14:16 Comment(0)
S
1

Let's create a list similar to what you described:

import random  
l = [random.randint(-100000,100000) for i in xrange(1000000)]

Now test some possible solutions:

interval=range(400,800)

def v2():
    """ return a list """
    return [i for i in l if i in interval]

def v3():
    """ return a generator """
    return list((i for i in l if i in interval))

def v4():
    def te(x):
        return x in interval

    return filter(te,l)

def v5():
    return [i for i in ifilter(lambda x: x in interval, l)]    


print len(v2()),len(v3()), len(v4()), len(v5())
cmpthese.cmpthese([v2,v3,v4,v5],micro=True, c=2)

Prints this:

   rate/sec   usec/pass   v5    v4    v2    v3
v5        0 6929225.922   -- -0.4% -1.0% -1.6%
v4        0 6903028.488 0.4%    -- -0.6% -1.2%
v2        0 6861472.487 1.0%  0.6%    -- -0.6%
v3        0 6817855.477 1.6%  1.2%  0.6%    --

HOWEVER, watch what happens if interval is a set instead of a list:

interval=set(range(400,800))
cmpthese.cmpthese([v2,v3,v4,v5],micro=True, c=2)

  rate/sec  usec/pass     v5     v4     v3     v2
v5        5 201332.569     -- -20.6% -62.9% -64.6%
v4        6 159871.578  25.9%     -- -53.2% -55.4%
v3       13  74769.974 169.3% 113.8%     --  -4.7%
v2       14  71270.943 182.5% 124.3%   4.9%     --

Now comparing with numpy:

na=np.array(l)

def v7():
    """ assume you have to convert from list => numpy array and return a list """
    arr=np.array(l)
    tgt = np.where((arr >= 400) & (arr < 800)) 
    return [arr[x] for x in tgt][0].tolist()


def v8():
    """ start with a numpy list but return a python list """
    tgt = np.where((na >= 400) & (na < 800)) 
    return na[tgt].tolist()


def v9():
    """ numpy all the way through """
    tgt = np.where((na >= 400) & (na < 800)) 
    return [na[x] for x in tgt][0]  
    # or return na[tgt] if you prefer that syntax...    

cmpthese.cmpthese([v2,v3,v4,v5, v7, v8,v9],micro=True, c=2)  

   rate/sec  usec/pass      v5      v4      v7     v3     v2     v8     v9
v5        5 185431.957      --  -17.4%  -24.7% -63.3% -63.4% -93.6% -93.6%
v4        7 153095.007   21.1%      --   -8.8% -55.6% -55.7% -92.3% -92.3%
v7        7 139570.475   32.9%    9.7%      -- -51.3% -51.4% -91.5% -91.5%
v3       15  67983.985  172.8%  125.2%  105.3%     --  -0.2% -82.6% -82.6%
v2       15  67861.438  173.3%  125.6%  105.7%   0.2%     -- -82.5% -82.5%
v8       84  11850.476 1464.8% 1191.9% 1077.8% 473.7% 472.6%     --  -0.0%
v9       84  11847.973 1465.1% 1192.2% 1078.0% 473.8% 472.8%   0.0%     --   

Clearly numpy is faster than pure python as long as you can work with numpy all the way through. Otherwise, use a set for the interval to speed up a bit...

Swagsman answered 24/8, 2012 at 14:36 Comment(2)
That's not a fair comparison. v3 doesn't actually do any of the work. You would need to compare v2 with a version of v3 which actually builds the list.Favorable
@DSM: fixed v3 to be a fair comparisonSwagsman
A
0

I think you are looking for something like this..

b=[i for i in a if 4<=i<90]
print sorted(set(b))
[4, 5, 6, 7, 8, 9, 23, 65, 67, 78]
Ataliah answered 17/4, 2013 at 12:53 Comment(0)
F
0

If your data set isn't too sparse, you could use "bins" to store and retrieve the data. For example:

a = [4,2,1,7,9,4,3,6,8,97,7,65,3,2,2,78,23,1,3,4,5,67,8,100]

# Initalize a list of 0's [0, 0, ...]
# This is assuming that the minimum possible value is 0
bins = [0 for _ in range(max(a) + 1)]  

# Update the bins with the frequency of each number
for i in a:
    bins[i] += 1


def NumbersWithinRange(data, interval):
    result = []
    for i in range(interval[0], interval[1] + 1):
        freq = data[i]
        if freq > 0:
            result += [i] * freq
    return result

This works for this test case:

print(NumbersWithinRange(bins, [4, 20]))
# [4, 4, 4, 5, 6, 7, 7, 8, 8, 9]

For simplicity, I omitted some bounds checking in the function.

To reiterate, this may work better in terms of space and time usage, but it depends heavily on your particular data set. The less sparse the data set, the better it will do.

Farce answered 7/5, 2019 at 14:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.