Find integer nearest-neighbour in a dict

Asked 17/3, 2015 at 8:34 Answered 17/3, 2015 at 9:10

Solved python algorithm dictionary nearest-neighbor

I have a dict that takes integer keys:

a = {}
a[1] = 100
a[55] = 101
a[127] = 102

I would like to be able to take the nearest neighbour when asking :

a[20] # should return a[1] = 100
a[58] # should return a[55] = 101
a[167] # should return a[127] = 102

Is there a pythonic way of doing this? (I imagine this can be done by looping on all dict, but that's probably not the most elegant solution?)

Same question with double-index (integers as well) :

 b[90, 1] = 100, b[90, 55] = 101, b[90, 127] = 102
 b[70, 1] = 40, b[70, 45] = 41, b[70, 107] = 42

I would like to be able to get b[73, 40] = b[70, 45] = 41, i.e. nearest neighboor in a 2D plane.

Unaffected answered 17/3, 2015 at 8:34 Comment(16)

@wim Yes, the dict is static. – Unaffected 17/3, 2015 at 8:37

What do you want in the case of a tie for distance? – Autogiro 17/3, 2015 at 8:37

@Autogiro If equal distance, both neighbours would work for me – Unaffected 17/3, 2015 at 8:47

do you mean you want both? or either will do? note that for the 2-d case, there can be a 4-way tie – Autogiro 17/3, 2015 at 8:47

Let's say always take the upper neighbour (in both directions) – Unaffected 17/3, 2015 at 8:48

@Autogiro No, there can be an unbounded amount of ties if real keys are allowed. – Orchestrate 17/3, 2015 at 8:48

i wrote a dict in C that makes this easier. the way it works it has forward and reverse methods. look up a[20] and it returns a not found status but it remembers where at. the dict is ordered. just go forward and reverse and it gives a[55] an a[1] for those 2 methods. the caller can then decide which. – Burck 17/3, 2015 at 8:49

@Orchestrate not with integer indices, there can't – Autogiro 17/3, 2015 at 8:49

do you want to know when you are getting a neighbor? – Burck 17/3, 2015 at 8:52

@Autogiro not infinite, but all discrete points on a circle will be ties. – Zecchino 17/3, 2015 at 8:53

@Zecchino could be asking for the center makes it harder – Burck 17/3, 2015 at 8:55

Order of magnitude: I have approx ~ 2000 static elements in the dict. And the dict might be accessed 100 times per second with a[i, j], with 0<i<128, 0<j<128. This has to run on a Raspberry Pi, and be very light ;) – Unaffected 17/3, 2015 at 8:55

With a standard dictionary, you will not be able to find your nearest neighbors in less than linear time, due to the characteristics of the hash function. If you just want to avoid explicit for loops in your code, Timothée's answer is fine, but the loop is still there. If you want smaller complexity, you'll have to look at alternative containers (see space-partitioning methods) which are typically more costly to build once but are quicker to search into. – Schuller 17/3, 2015 at 8:57

Then you need to do some pre-calculation steps to create a datastructure that is more suited to your needs. Some sort of tree structure would seem to be the natural choice... – Zecchino 17/3, 2015 at 8:59

@Unaffected sounds like you are doing a map app. there are better ways to organize the objects. – Burck 17/3, 2015 at 9:0

If you want it efficiently, numpy/scipy is perhaps the best way e.g. #10819046 – Oliviero 17/3, 2015 at 9:7

Something similar to this:

class CustomDict(dict):
    def __getitem__(self, key):
        try:
            return dict.__getitem__(self, key)
        except KeyError:
            closest_key = min(self.keys(), key=lambda x: abs(x - key))
            return dict.__getitem__(self, closest_key)

Or this:

class CustomDict(dict):
    def __getitem__(self, key):
        if key in self:
            return dict.__getitem__(self, key)
        else:
            closest_key = min(self.keys(), key=lambda x: abs(x - key))
            return dict.__getitem__(self, closest_key)

Both gives this result:

a = CustomDict()
a[1] = 100
a[55] = 101
a[127] = 102

print a[20] # prints 100
print a[58] # prints 101
print a[167] # prints 102

And for the double index version:

class CustomDoubleDict(dict):
    def __getitem__(self, key):
        if key in self:
            return dict.__getitem__(self, key)
        else:
            closest_key = min(self.keys(), key=lambda c: (c[0] - key[0]) ** 2 + (c[1] - key[1]) ** 2)
            return dict.__getitem__(self, closest_key)


b = CustomDoubleDict()
b[90, 1] = 100
b[90, 55] = 101
b[90, 127] = 102
b[70, 1] = 40
b[70, 45] = 41
b[70, 107] = 42

print b[73, 40]  # prints 41
print b[70, 45]  # prints 41

Dora answered 17/3, 2015 at 8:45 Comment(5)

Thanks for the solution. Can the downvoters explain why it's not a good solution? (I don't understand yet exactly why it would be good or bad...) – Unaffected 17/3, 2015 at 8:50

Yes it would be great if the downvoters can explain in details why they think it might not be correct. Just trying to learn here. :) – Pegu 17/3, 2015 at 8:54

I downvoted because this loops over all keys every time there is not an exact match. See my answer for an O(log n) solution. – Orchestrate 17/3, 2015 at 9:1

It was difficult to choose an accepted answer. This one for simplicty, and a quick solution to the problem (but non optimal because loops over all keys). jedwards's one for N-dimensional generalization. And wim's answer for the application to my problem. – Unaffected 17/3, 2015 at 10:16

@Unaffected it's a sign that it's a good question when it attracts many varied answers and discussion, nice job. – Crossland 17/3, 2015 at 10:30

Update: After benchmarking the two approaches in this answer, the second approach is significantly better, to the point that it should almost be strictly preferred.

The following approach handles n-dimensions identically:

class NearestDict(dict):
    def __init__(self, ndims):
        super(NearestDict, self).__init__()
        self.ndims = ndims

    # Enforce dimensionality
    def __setitem__(self, key, val):
        if not isinstance(key, tuple): key = (key,)
        if len(key) != self.ndims: raise KeyError("key must be %d dimensions" % self.ndims)
        super(NearestDict, self).__setitem__(key, val)

    @staticmethod
    def __dist(ka, kb):
        assert len(ka) == len(kb)
        return sum((ea-eb)**2 for (ea, eb) in zip(ka, kb))

    # Helper method and might be of use
    def nearest_key(self, key):
        if not isinstance(key, tuple): key = (key,)
        nk = min((k for k in self), key=lambda k: NearestDict.__dist(key, k))
        return nk

    def __missing__(self, key):
        if not isinstance(key, tuple): key = (key,)
        if len(key) != self.ndims: raise KeyError("key must be %d dimensions" % self.ndims)
        return self[self.nearest_key(key)]

Demo:

a = NearestDict(1)
a[1] = 100
a[55] = 101
a[127] = 102
print a[20]    # 100
print a[58]    # 100
print a[167]   # 102
print a.nearest_key(20)     # (1,)
print a.nearest_key(58)     # (55,)
print a.nearest_key(127)    # (127,)

b = NearestDict(2)
b[90, 1]   = 100
b[90, 55]  = 101
b[90, 127] = 102
b[70, 1]   = 40
b[70, 45]  = 41
b[70, 107] = 42
print b[73, 40] # 41
print b.nearest_key((73,40)) # (70, 45)

Note that if the key exists, the lookup is no slower than a standard dictionary lookup. If the key does not exist, you compute the distance between every existing key. Nothing is cached, although you could tack that on I suppose.

Edit:

From the approach suggested by Kasra's answer the following approach implements the same class as above using scipy's cKDTree:

Note that there is an additional optional argument, regenOnAdd that will allow you to defer (re)-building the KDTree until after you have completed (the majority of) your inserts:

from scipy.spatial import cKDTree

class KDDict(dict):
    def __init__(self, ndims, regenOnAdd=False):
        super(KDDict, self).__init__()
        self.ndims = ndims
        self.regenOnAdd = regenOnAdd
        self.__keys = []
        self.__tree = None
        self.__stale = False

    # Enforce dimensionality
    def __setitem__(self, key, val):
        if not isinstance(key, tuple): key = (key,)
        if len(key) != self.ndims: raise KeyError("key must be %d dimensions" % self.ndims)
        self.__keys.append(key)
        self.__stale = True
        if self.regenOnAdd: self.regenTree()
        super(KDDict, self).__setitem__(key, val)

    def regenTree(self):
        self.__tree = cKDTree(self.__keys)
        self.__stale = False

    # Helper method and might be of use
    def nearest_key(self, key):
        if not isinstance(key, tuple): key = (key,)
        if self.__stale: self.regenTree()
        _, idx = self.__tree.query(key, 1)
        return self.__keys[idx]

    def __missing__(self, key):
        if not isinstance(key, tuple): key = (key,)
        if len(key) != self.ndims: raise KeyError("key must be %d dimensions" % self.ndims)
        return self[self.nearest_key(key)]

Output is the same as the above approach.

Benchmark Results

To understand the performance of the three approaches (NearestDict, KDDict(True) (regen on insert), and KDDict(False) (defer regen)), I briefly benchmarked them.

I ran 3 different tests. The parameters that stayed the same across the tests were:

Number of Test Iterations: I did each test 5 times and took the minimum time. (Note timeit.repeat defaults to 3).
Point boundaries: I generated integer key points in the range 0 <= x < 1000
Number of lookups: I timed inserts and lookups separately. The three tests below all use 10,000 lookups.

The first test used keys of 4 dimensions, and 1,000 insertions.

{'NDIMS': 4, 'NITER': 5, 'NELEMS': 1000, 'NFINDS': 10000, 'DIM_LB': 0, 'DIM_UB': 1000, 'SCORE_MUL': 100}
insert::NearestDict       0.125
insert::KDDict(regen)    35.957
insert::KDDict(defer)     0.174
search::NearestDict    2636.965
search::KDDict(regen)    49.965
search::KDDict(defer)    51.880

The second test used keys of 4 dimensions and 100 insertions. I wanted to vary the number of insertions to see how well the two approaches performed as the density of the dictionary varied.

{'NDIMS': 4, 'NITER': 5, 'NELEMS': 100, 'NFINDS': 10000, 'DIM_LB': 0, 'DIM_UB': 1000, 'SCORE_MUL': 100}
insert::NearestDict       0.013
insert::KDDict(regen)     0.629
insert::KDDict(defer)     0.018
search::NearestDict     247.920
search::KDDict(regen)    44.523
search::KDDict(defer)    44.718

The third test used 100 insertions (like the second test) but 12 dimensions. I wanted to see how the approaches performed as key dimensionality increased.

{'NDIMS': 12, 'NITER': 5, 'NELEMS': 100, 'NFINDS': 10000, 'DIM_LB': 0, 'DIM_UB': 1000, 'SCORE_MUL': 100}
insert::NearestDict       0.013
insert::KDDict(regen)     0.722
insert::KDDict(defer)     0.017
search::NearestDict     405.092
search::KDDict(regen)    49.046
search::KDDict(defer)    50.601

Discussion

KDDict with continuous regeneration (KDDict(True)) is either fractionally faster (in lookup) or considerably slower (in insertion). Because of this, I'm leaving it out of the discussion and focusing on NearestDict and KDDict(False), now referred to as simply KDDict

The results were surprisingly in favor of KDDict with deferred regeneration.

For insertion, in all cases, KDDict performed slightly worse than NearestDict. This was to be expected because of the additional list append operation.

For search, in all cases, KDDict performed significantly better than NearestDict.

As sparsity of the dictionary decreased / density increased, NearestDict's performance decreased to a much greater extent than KDDict. When going from 100 keys to 1000 keys, NearestDict search time increased by 9.64x while KDDict search time increased by only 0.16x.

As the dimensionality of the dictionary was increased, NearestDict's performance decreased to a greater than extent than KDDict. When going from 4 to 12 dimensions, NearestDict search time increased by 0.64x while KDDict search time increased by only 0.13x.

In light of this, and the relatively equal complexity of the two classes, if you have access to the scipy toolkit, using the KDDict approach is strongly recommended.

Crossland answered 17/3, 2015 at 9:3 Comment(4)

This loops over every possible key for every non-exact index though. – Orchestrate 17/3, 2015 at 9:3

@Orchestrate I just edited that in. I agree. Some sort of filtering could be implemented (these keys are strictly worse than other candidates), or caching, or a maybe a more appropriate hashing scheme could improve this quite a bit for large dictionaries. – Crossland 17/3, 2015 at 9:6

Like the N-dimension possibility! – Connaught 17/3, 2015 at 9:13

Thanks for all these studies on this question! – Unaffected 20/3, 2015 at 6:35

Something similar to this:

class CustomDict(dict):
    def __getitem__(self, key):
        try:
            return dict.__getitem__(self, key)
        except KeyError:
            closest_key = min(self.keys(), key=lambda x: abs(x - key))
            return dict.__getitem__(self, closest_key)

Or this:

class CustomDict(dict):
    def __getitem__(self, key):
        if key in self:
            return dict.__getitem__(self, key)
        else:
            closest_key = min(self.keys(), key=lambda x: abs(x - key))
            return dict.__getitem__(self, closest_key)

Both gives this result:

a = CustomDict()
a[1] = 100
a[55] = 101
a[127] = 102

print a[20] # prints 100
print a[58] # prints 101
print a[167] # prints 102

And for the double index version:

class CustomDoubleDict(dict):
    def __getitem__(self, key):
        if key in self:
            return dict.__getitem__(self, key)
        else:
            closest_key = min(self.keys(), key=lambda c: (c[0] - key[0]) ** 2 + (c[1] - key[1]) ** 2)
            return dict.__getitem__(self, closest_key)


b = CustomDoubleDict()
b[90, 1] = 100
b[90, 55] = 101
b[90, 127] = 102
b[70, 1] = 40
b[70, 45] = 41
b[70, 107] = 42

print b[73, 40]  # prints 41
print b[70, 45]  # prints 41

Dora answered 17/3, 2015 at 8:45 Comment(5)

Thanks for the solution. Can the downvoters explain why it's not a good solution? (I don't understand yet exactly why it would be good or bad...) – Unaffected 17/3, 2015 at 8:50

Yes it would be great if the downvoters can explain in details why they think it might not be correct. Just trying to learn here. :) – Pegu 17/3, 2015 at 8:54

I downvoted because this loops over all keys every time there is not an exact match. See my answer for an O(log n) solution. – Orchestrate 17/3, 2015 at 9:1

@Unaffected it's a sign that it's a good question when it attracts many varied answers and discussion, nice job. – Crossland 17/3, 2015 at 10:30

Option 1:

Maintain a separate and ordered list of the keys (or use an OrderedDict). Find the nearest key using binary search. This should be O(log n).

Option 2: (If the data is not very large and sparse)

Since you mentioned the dict is static, make a pass through the dict to populate all the missing values once. Maintain the maximum and minimum keys, and override the __getitem__ so that keys above the maximum or below the minimum return the correct value. This should be O(1).

Option 3:

Just use the loop over the keys every time, it will be O(n). Try it in your application and you might find that the simple solution is quite fast and adequate anyway.

Autogiro answered 17/3, 2015 at 8:58 Comment(9)

I have approx 2000 elements in the dict, but they might be searched with a[i,j] with i and j in [1, 127]. So Option 2 would imply populating the dict with <16384 items (i,j). Probably this might be the best. It would take probably < 1 second and < 100kb in memory, right? What do you think? – Unaffected 17/3, 2015 at 9:5

Yes, that would be no problem. It would be better to use a numpy ndarray than a dict though. Since your data is numerical, you will surely have better performance and memory footprint this way. – Autogiro 17/3, 2015 at 9:6

In fact I simplified a bit, but the values are not numerical, they are Sound objects. My purpose is writing a sampler, with Samples[Midinote, Velocity] = Sound('mysound.wav'). Sometimes, you have only a few .wav files, and you want to be able to play for any MIDI note value, and for any velocity (on MIDI keyboard). That's why I need Samples[60, 43] to use MiddleC-60-Velocity50.wav if there is no sample available for velocity=43. – Unaffected 17/3, 2015 at 9:12

OK then perhaps a dict with tuple keys is fine (or ndarray of object). It will be important to make sure the duplicate values are not copies, but references to the same Sound object. – Autogiro 17/3, 2015 at 9:15

Also for search keys bounded in 1, 127 the simple (option 3) solution will likely be quite usable, possibly even the fastest! – Autogiro 17/3, 2015 at 9:23

About Option 3, do you mean for example Timothée Jeannin's answer, https://mcmap.net/q/1965321/-find-integer-nearest-neighbour-in-a-dict ? I still think looping on 2000 elements, and this 100 times per second on a RaspPi will be too much (moreover, this task should be very light : the CPU-needing task will be audio .wav mixing!) – Unaffected 17/3, 2015 at 9:29

Would you have an example code for Option 2, i.e. populating the dict in a single pass: for i in range(127): for j in range(127): a[i,j] = nearest... – Unaffected 17/3, 2015 at 9:35

you want to find the i, j which minimise sum((i - y)**2 + (j - x)**2 for x, y in zip(xs, ys)). xs, ys are lists of the 2000 vertices where you have data points. – Autogiro 17/3, 2015 at 9:41

Let us continue this discussion in chat. – Unaffected 17/3, 2015 at 11:36

What about using min with a proper key function :

>>> b ={(90, 55): 101, (90, 127): 102, (90, 1): 100}
>>> def nearest(x,y):
...   m=min(((i,j) for i,j in b ),key= lambda v:abs(v[0]-x)+abs(v[1]-y))
...   return b[m]
... 
>>> nearest(40,100)
102
>>> nearest(90,100)
102
>>> b
{(90, 55): 101, (90, 127): 102, (90, 1): 100}
>>> nearest(90,10)
100

The preceding answer is a pythonic answer that i suggested but if you look for a fast way you can use scipy.spatial.KDTree :

class scipy.spatial.KDTree(data, leafsize=10)

kd-tree for quick nearest-neighbor lookup

This class provides an index into a set of k-dimensional points which can be used to rapidly look up the nearest neighbors of any point.

Also have a look at http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.cKDTree.html#scipy-spatial-ckdtree

and http://docs.scipy.org/doc/scipy/reference/spatial.distance.html#module-scipy.spatial.distance

Prophesy answered 17/3, 2015 at 9:1 Comment(7)

This loops over every key for every index. – Orchestrate 17/3, 2015 at 9:2

@Orchestrate i just wanted to suggest a pythoinc answer! – Prophesy 17/3, 2015 at 9:20

The KDTree approach looks promising. – Crossland 17/3, 2015 at 9:21

Thanks for this pythonic solution, and the other one, optimized for speed. – Unaffected 17/3, 2015 at 9:24

@Kasra, are you planning on implementing the KDTree approach? If not, I'll modify my class to use KDTree (and maybe benchmark it one day, if I find the time) – Crossland 17/3, 2015 at 9:40

@Crossland No, no problem feel free to do it ;) – Prophesy 17/3, 2015 at 9:45

@Kasra Thanks! Implementation was quite simple actually, great lead on the KDTree -- profiling to follow :) – Crossland 17/3, 2015 at 9:57

Untested O(log n) example for the one-dimensional case:

import collections
import bisect


class ProximityDict(collections.MutableMapping):
    def __init__(self, *args, **kwargs):
        self.store = dict()
        self.ordkeys = []
        self.update(dict(*args, **kwargs))

    def __getitem__(self, key):
        try: return self.store[key]
        except KeyError:
            cand = bisect.bisect_left(self.ordkeys, key)
            if cand == 0: return self.store[self.ordkeys[0]]

            return self.store[
                min(self.ordkeys[cand], self.ordkeys[cand-1],
                    key=lambda k: abs(k - key))
            ]

    def __setitem__(self, key, value):
        if not key in self.store: bisect.insort_left(self.ordkeys, key)
        self.store[key] = value

    def __delitem__(self, key):
        del self.store[key]
        del self.keys[bisect.bisect_left(self.ordkeys, key)]

    def __iter__(self):
        return iter(self.store)

    def __len__(self):
        return len(self.store)

The two-dimensional case is significantly more annoying (you'd need to store keys ordered in a quadtree), but is possible in a similar way.

I did not code deleting to also have 'proximity' behaviour, but you could do that as well.

Orchestrate answered 17/3, 2015 at 8:56 Comment(4)

Your ProximityDict is not a dict object and relies on ordered keys. The question is Is there a pythonic way of doing this? not Is there a fast way to do this?. Pythonic is simple and easy to understand, fast comes later. – Pegu 17/3, 2015 at 9:9

It is nonetheless pythonic code, and the question clearly stated they were aware of the obvious O(n) solution, so there is no need to provide an O(n) answer imo. +1 – Autogiro 17/3, 2015 at 9:13

@Autogiro the questioner asked for a pythonic way of doing it, imo that means one that takes advantage of various syntactic opportunities to produce easily-readable and maintainable code. His for loop note isn't about performance, but "elegance" (his words, not mine). That being said, orlp still got an upvote from me, despite not handling 2-dimensions (which was also asked). – Crossland 17/3, 2015 at 9:18

@TimothéeJeannin My ProximityDict inherits from collections.MutableMapping, which to my knowledge is the best way to go about this. And what do you mean that I rely on ordered keys? I explicitly keep track of ordered keys myself. – Orchestrate 17/3, 2015 at 9:25

Here come one pythonic solution that rely mainly on map/filter operations

class NeirestSearchDictionnary1D(dict):
    """ An extended dictionnary that returns the value that is the nearest to 
    the requested key. As it's key distance is defined for simple number 
    values, trying to add other keys will throw error. """
    def __init__(self):
        """ Constructor of the dictionnary.
        It only allow to initialze empty dict """
        dict.__init__(self)

    def keyDistance(self, key1, key2):
        """ returns a distance between 2 dic keys """
        return abs(key1-key2)

    def __setitem__(self, key, value):
        """ override  the addition of a couple in the dict."""
        #type checking
        if not (isinstance(key, int) or isinstance(key, float)):
            raise TypeError("The key of such a "+ type(self) + "must be a simple numerical value")
        else:
            dict.__setitem__(self, key, value)

    def __getitem__(self, key):
        """ Override the getting item operation """
        #compute the minial distance
        minimalDistance = min(map(lambda x : self.keyDistance(key, x), self.keys()))
        #get the list of key that minimize the distance
        resultSetKeys = filter(lambda x : self.keyDistance(key, x) <= minimalDistance, self.keys())
        #return the values binded to the keys minimizing the distances.
        return list(map(lambda x : dict.__getitem__(self, x), resultSetKeys))

if __name__ == "__main__":

    dic = NeirestSearchDictionnary1D()
    dic[1] = 100
    dic[55] = 101
    dic[57] = 102
    dic[127] = 103
    print("the entire dict :", dic)
    print("dic of '20'", dic[20])
    print("dic of '56'", dic[56])

Obviously, you can extends this to 2D dimension with little work.

Highup answered 17/3, 2015 at 9:10 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags