Finding k closest numbers to a given number
Asked Answered
H

4

32

Say I have a list [1,2,3,4,5,6,7]. I want to find the 3 closest numbers to, say, 6.5. Then the returned value would be [5,6,7].

Finding one closest number is not that tricky in python, which can be done using

min(myList, key=lambda x:abs(x-myNumber))

But I am trying not to put a loop around this to find k closest numbers. Is there a pythonic way to achieve the above task?

Hodeida answered 9/6, 2014 at 0:29 Comment(0)
D
48

The short answer

The heapq.nsmallest() function will do this neatly and efficiently:

>>> from heapq import nsmallest
>>> s = [1,2,3,4,5,6,7]
>>> nsmallest(3, s, key=lambda x: abs(x - 6.5))
[6, 7, 5]

Essentially this says, "Give me the three input values that have the smallest absolute difference from the number 6.5".

Optimizing for repeated lookups

In the comments, @Phylliida, asked how to optimize for repeated lookups with differing start points. One approach would be to pre-sort the data and then use bisect to locate the center of a small search segment:

from bisect import bisect

def k_nearest(k, center, sorted_data):
    'Return *k* members of *sorted_data* nearest to *center*'
    i = bisect(sorted_data, center)
    segment = sorted_data[max(i-k, 0) : i+k]
    return nsmallest(k, segment, key=lambda x: abs(x - center))

For example:

>>> s.sort()
>>> k_nearest(3, 6.5, s)
[6, 7, 5]
>>> k_nearest(3, 0.5, s)
[1, 2, 3]
>>> k_nearest(3, 4.5, s)    
[4, 5, 3]
>>> k_nearest(3, 5.0, s)
[5, 4, 6]
Doodlebug answered 9/6, 2014 at 0:45 Comment(11)
Then, why does the documentation state: "The latter two functions perform best for smaller values of n. For larger values, it is more efficient to use the sorted() function"? To me this is equivalent to saying nsmallest is worse than O(nlogn) in asymptotic complexity, but happens to be more efficient for a range of small ns. Either the algorithm isn't as efficient as you state, or the documentation is wrong in some way.Kolnos
@Bakuriu: nsmallest() has a time complexity of O(n log k), where n is the size of container and k is the number of smallest values you are looking for. The complexity of sorted() is O(n log n), but it has a better constant factor. So if k/n is small compared to 1, nsmallest() will win. If k becomes bigger compared to n, at some point sorted() will win due to the better constant. Neither Raymond nor the documentation are wrong, though the documentation could be clearer about that "smaller" actually means smaller compared to the size of the collection.Brigadier
@SvenMarnach: Linear time selection is possible as well, which the documentation neglects.Barytone
@NeilG: nsmallest() returns the k smallest elements in sorted order, and the fastest algorithm to do this is O(n log k). I believe it's still the fastest algorithm if you drop the requirement that the result is sorted. You can select the kth smallest element in O(n), but I fail to see how you could select the k smallest elements.Brigadier
@SvenMarnach: No, the fastest algorithm to do that is O(n). You can select the kth smallest element in linear time, and you can compare the elements in the original array against it in linear time. The total algorithm is therefore linear time.Barytone
@RaymondHettinger: Yes, but as k grows, then the selection algorithm will beat the heap algorithm. It is linear time. (I think the constant is 6, but I forget.)Barytone
@RaymondHettinger: Please let's not get into a long discussion where we're talking past each other. We can both agree that selection is better when k is large compared with n, say n/4. The heap algorithm is better when k is small. There are multiple selection algorithms including Rivest's, quickselect, and median-of-medians.Barytone
I believe median-of-medians does about 10n total comparisons (with good locality of reference and parallelizability). Quickselect and Rivest's are fast on average. If you want the result in sorted order, then you need to pay klogk regardless of which algorithm you choose, but sorting has nothing to do with the posted question.Barytone
And just to be clear, only quickselect has a "horrific worst-case". The other two algorithms have a linear or n+root(n) worst case. None of these three selection algorithms have "catastrophic cache performance" as they all have excellent locality of reference. If anything, the heap is worst in this regard as bubbling up an element to the top (which is rare in the average case, sure) hits O(logk) cache pages.Barytone
If I need to do lots of queries (something like [nsmallest(3, s, key=lambda x: abs(x-v)) for v in arr]) is there a way to reuse the previous results to speed this up? I have about 1 million elements in s and 1 million elements in vInutility
@Inutility I updated the answer to include this case as well.Doodlebug
Y
5

You could compute distances, and sort:

[n for d, n in sorted((abs(x-myNumber), x) for x in myList)[:k]]

This does the following:

  1. Create a sequence of tuples (d, x) where d is the distance to your target
  2. Select the first k elements of that list
  3. Extract just the number values from the result, discarding the distance
Yarkand answered 9/6, 2014 at 0:31 Comment(2)
sorted accepts a key function, so you don't need the decorate-sort-undecorate pattern.Palm
Oh that's true, and indeed the OP was most of the way there! Anyway, Raymond's answer is better.Yarkand
A
2

Both answers were good, and Greg was right, Raymond's answer is more high level and easier to implement, but I built upon Greg's answer because it was easier to manipulate to fit my need.

In case anyone is searching for a way to find the n closest values from a list of dicts.

My dict looks like this, where npi is just an identifier that I need along with the value:

mydict = {u'fnpi': u'1982650024',
 u'snpi': {u'npi': u'1932190360', u'value': 2672},
 u'snpis': [{u'npi': u'1831289255', u'value': 20},
  {u'npi': u'1831139799', u'value': 20},
  {u'npi': u'1386686137', u'value': 37},
  {u'npi': u'1457355257', u'value': 45},
  {u'npi': u'1427043645', u'value': 53},
  {u'npi': u'1477548675', u'value': 53},
  {u'npi': u'1851351514', u'value': 57},
  {u'npi': u'1366446171', u'value': 60},
  {u'npi': u'1568460640', u'value': 75},
  {u'npi': u'1326046673', u'value': 109},
  {u'npi': u'1548281124', u'value': 196},
  {u'npi': u'1912989989', u'value': 232},
  {u'npi': u'1336147685', u'value': 284},
  {u'npi': u'1801894142', u'value': 497},
  {u'npi': u'1538182779', u'value': 995},
  {u'npi': u'1932190360', u'value': 2672},
  {u'npi': u'1114020336', u'value': 3264}]}

value = mydict['snpi']['value'] #value i'm working with below
npi = mydict['snpi']['npi'] #npi (identifier) i'm working with below
snpis = mydict['snpis'] #dict i'm working with below

To get an [id, value] list (not just a list of values) , I use this:

[[id,val] for diff, val, id in sorted((abs(x['value']-value), x['value'], x['npi']) for x in snpis)[:6]]

Which produces this:

[[u'1932190360', 2672],
 [u'1114020336', 3264],
 [u'1538182779', 995],
 [u'1801894142', 497],
 [u'1336147685', 284],
 [u'1912989989', 232]]

EDIT

I actually found it pretty easy to manipulate Raymond's answer too, if you're dealing with a dict (or list of lists).

from heapq import nsmallest
[[i['npi'], i['value']] for i in nsmallest(6, snpis, key=lambda x: abs(x['value']-value))]

This will produce the same as the above output.

And this

nsmallest(6, snpis, key=lambda x: abs(x['value']-value)) will produce a dict instead.

Amplifier answered 8/12, 2015 at 17:49 Comment(0)
S
0

For those who wants the index instead:

def find_nearest_index(array, value, k):
    array = np.array(array)
    return np.argsort(abs(array - value))[:k]

Example:

find_nearest_index([-3,0,1,2,4,5], 0.2, 4)

# array([1, 2, 3, 0], dtype=int64)
# distance = [3.20 0.20 0.80 1.80 3.80 4.80]
Supraorbital answered 4/4, 2021 at 3:17 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.