Finding k closest numbers to a given number

Asked 9/6, 2014 at 0:29 Answered 4/4, 2021 at 3:17

Say I have a list [1,2,3,4,5,6,7]. I want to find the 3 closest numbers to, say, 6.5. Then the returned value would be [5,6,7].

Finding one closest number is not that tricky in python, which can be done using

min(myList, key=lambda x:abs(x-myNumber))

But I am trying not to put a loop around this to find k closest numbers. Is there a pythonic way to achieve the above task?

Hodeida answered 9/6, 2014 at 0:29 Comment(0)

The short answer

The heapq.nsmallest() function will do this neatly and efficiently:

>>> from heapq import nsmallest
>>> s = [1,2,3,4,5,6,7]
>>> nsmallest(3, s, key=lambda x: abs(x - 6.5))
[6, 7, 5]

Essentially this says, "Give me the three input values that have the smallest absolute difference from the number 6.5".

Optimizing for repeated lookups

In the comments, @Phylliida, asked how to optimize for repeated lookups with differing start points. One approach would be to pre-sort the data and then use bisect to locate the center of a small search segment:

from bisect import bisect

def k_nearest(k, center, sorted_data):
    'Return *k* members of *sorted_data* nearest to *center*'
    i = bisect(sorted_data, center)
    segment = sorted_data[max(i-k, 0) : i+k]
    return nsmallest(k, segment, key=lambda x: abs(x - center))

For example:

>>> s.sort()
>>> k_nearest(3, 6.5, s)
[6, 7, 5]
>>> k_nearest(3, 0.5, s)
[1, 2, 3]
>>> k_nearest(3, 4.5, s)    
[4, 5, 3]
>>> k_nearest(3, 5.0, s)
[5, 4, 6]

Doodlebug answered 9/6, 2014 at 0:45 Comment(11)

Then, why does the documentation state: "The latter two functions perform best for smaller values of n. For larger values, it is more efficient to use the sorted() function"? To me this is equivalent to saying nsmallest is worse than O(nlogn) in asymptotic complexity, but happens to be more efficient for a range of small ns. Either the algorithm isn't as efficient as you state, or the documentation is wrong in some way. – Kolnos 9/6, 2014 at 10:51

@Bakuriu: nsmallest() has a time complexity of O(n log k), where n is the size of container and k is the number of smallest values you are looking for. The complexity of sorted() is O(n log n), but it has a better constant factor. So if k/n is small compared to 1, nsmallest() will win. If k becomes bigger compared to n, at some point sorted() will win due to the better constant. Neither Raymond nor the documentation are wrong, though the documentation could be clearer about that "smaller" actually means smaller compared to the size of the collection. – Brigadier 9/6, 2014 at 12:15

@SvenMarnach: Linear time selection is possible as well, which the documentation neglects. – Barytone 9/6, 2014 at 17:27

@NeilG: nsmallest() returns the k smallest elements in sorted order, and the fastest algorithm to do this is O(n log k). I believe it's still the fastest algorithm if you drop the requirement that the result is sorted. You can select the kth smallest element in O(n), but I fail to see how you could select the k smallest elements. – Brigadier 11/6, 2014 at 22:7

@SvenMarnach: No, the fastest algorithm to do that is O(n). You can select the kth smallest element in linear time, and you can compare the elements in the original array against it in linear time. The total algorithm is therefore linear time. – Barytone 11/6, 2014 at 22:33

@RaymondHettinger: Yes, but as k grows, then the selection algorithm will beat the heap algorithm. It is linear time. (I think the constant is 6, but I forget.) – Barytone 12/6, 2014 at 2:12

@RaymondHettinger: Please let's not get into a long discussion where we're talking past each other. We can both agree that selection is better when k is large compared with n, say n/4. The heap algorithm is better when k is small. There are multiple selection algorithms including Rivest's, quickselect, and median-of-medians. – Barytone 12/6, 2014 at 21:14

I believe median-of-medians does about 10n total comparisons (with good locality of reference and parallelizability). Quickselect and Rivest's are fast on average. If you want the result in sorted order, then you need to pay klogk regardless of which algorithm you choose, but sorting has nothing to do with the posted question. – Barytone 12/6, 2014 at 21:21

And just to be clear, only quickselect has a "horrific worst-case". The other two algorithms have a linear or n+root(n) worst case. None of these three selection algorithms have "catastrophic cache performance" as they all have excellent locality of reference. If anything, the heap is worst in this regard as bubbling up an element to the top (which is rare in the average case, sure) hits O(logk) cache pages. – Barytone 12/6, 2014 at 21:34

If I need to do lots of queries (something like [nsmallest(3, s, key=lambda x: abs(x-v)) for v in arr]) is there a way to reuse the previous results to speed this up? I have about 1 million elements in s and 1 million elements in v – Inutility 18/3, 2018 at 20:46

@Inutility I updated the answer to include this case as well. – Doodlebug 19/3, 2018 at 0:19

You could compute distances, and sort:

[n for d, n in sorted((abs(x-myNumber), x) for x in myList)[:k]]

This does the following:

Create a sequence of tuples (d, x) where d is the distance to your target
Select the first k elements of that list
Extract just the number values from the result, discarding the distance

Yarkand answered 9/6, 2014 at 0:31 Comment(2)

sorted accepts a key function, so you don't need the decorate-sort-undecorate pattern. – Palm 9/6, 2014 at 0:53

Oh that's true, and indeed the OP was most of the way there! Anyway, Raymond's answer is better. – Yarkand 9/6, 2014 at 0:54

Both answers were good, and Greg was right, Raymond's answer is more high level and easier to implement, but I built upon Greg's answer because it was easier to manipulate to fit my need.

In case anyone is searching for a way to find the n closest values from a list of dicts.

My dict looks like this, where npi is just an identifier that I need along with the value:

mydict = {u'fnpi': u'1982650024',
 u'snpi': {u'npi': u'1932190360', u'value': 2672},
 u'snpis': [{u'npi': u'1831289255', u'value': 20},
  {u'npi': u'1831139799', u'value': 20},
  {u'npi': u'1386686137', u'value': 37},
  {u'npi': u'1457355257', u'value': 45},
  {u'npi': u'1427043645', u'value': 53},
  {u'npi': u'1477548675', u'value': 53},
  {u'npi': u'1851351514', u'value': 57},
  {u'npi': u'1366446171', u'value': 60},
  {u'npi': u'1568460640', u'value': 75},
  {u'npi': u'1326046673', u'value': 109},
  {u'npi': u'1548281124', u'value': 196},
  {u'npi': u'1912989989', u'value': 232},
  {u'npi': u'1336147685', u'value': 284},
  {u'npi': u'1801894142', u'value': 497},
  {u'npi': u'1538182779', u'value': 995},
  {u'npi': u'1932190360', u'value': 2672},
  {u'npi': u'1114020336', u'value': 3264}]}

value = mydict['snpi']['value'] #value i'm working with below
npi = mydict['snpi']['npi'] #npi (identifier) i'm working with below
snpis = mydict['snpis'] #dict i'm working with below

To get an [id, value] list (not just a list of values) , I use this:

[[id,val] for diff, val, id in sorted((abs(x['value']-value), x['value'], x['npi']) for x in snpis)[:6]]

Which produces this:

[[u'1932190360', 2672],
 [u'1114020336', 3264],
 [u'1538182779', 995],
 [u'1801894142', 497],
 [u'1336147685', 284],
 [u'1912989989', 232]]

EDIT

I actually found it pretty easy to manipulate Raymond's answer too, if you're dealing with a dict (or list of lists).

from heapq import nsmallest
[[i['npi'], i['value']] for i in nsmallest(6, snpis, key=lambda x: abs(x['value']-value))]

This will produce the same as the above output.

And this

nsmallest(6, snpis, key=lambda x: abs(x['value']-value)) will produce a dict instead.

Amplifier answered 8/12, 2015 at 17:49 Comment(0)

For those who wants the index instead:

def find_nearest_index(array, value, k):
    array = np.array(array)
    return np.argsort(abs(array - value))[:k]

Example:

find_nearest_index([-3,0,1,2,4,5], 0.2, 4)

# array([1, 2, 3, 0], dtype=int64)
# distance = [3.20 0.20 0.80 1.80 3.80 4.80]

Supraorbital answered 4/4, 2021 at 3:17 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

The short answer

Optimizing for repeated lookups

Recommended topics

Hot tags