How should I find nearest neighbors for every element in a list?

Asked 12/10, 2011 at 15:50 Answered 14/10, 2016 at 2:47

Solved algorithm language-agnostic nearest-neighbor

I have two sets of integers A and B (size of A less than or equal to B), and I want to answer the question, "How close is A to B?". The way I want to answer this question is by producing a measure of how far you have to go from a given a in A to find a b in B.

The specific measure I want to produce does the following: for each a, find the closest b, the only catch being that once I match a b with an a, I can no longer use that b to match any other a's. (EDIT: the algorithm I'm trying to implement will always prefer a shorter match. So if b is the nearest neighbor to more than one a, pick the a nearest to b. I'm not sure what to do if more than one a has the same distance to b, right now I'm picking the a that precedes b, but that's quite arbitrary and not necessarily optimal.) The measure that I'll for make these sets, the final product, is a histogram showing the number of pairs in the vertical axis and the distance of the pairs in the x-axis.

So if A = {1, 3, 4} and B = {1, 5, 6, 7}, I will get the following a,b pairs: 1,1, 4,5, 3,6. For these data, the histogram should show one pair with distance zero, one pair with distance 1, and one pair with distance 3.

(The actual size of these sets has an upper bound around 100,000 elements, and I read them in from disk already sorted low to high. The integers range from 1 to ~20,000,000. EDIT: also, the elements of A and B are unique, i.e. no repeated elements.)

The solution I've come up with feels a bit clunky. I'm using Perl, but the problem is more or less language agnostic.

First I make a hash, with one key for each number that appears in the union of A and B and values indicating whether each number appears in A, B, or both, e.g. $hash{5} = {a=>1, b=>1} if the number 5 appears in both data-sets. (If it only appeared in A, you'd have $hash{5} = {a=>1}.)
Next, I iterate over A to find all the hash elements that appear in A and B, mark them in the measure, and remove them from the hash.
Then, I sort all the hash keys and make each element of the hash point to its nearest neighbors, like a linked list, where a given hash element now looks like $hash{6} = {b=>1, previous=>4, next=>8}. The linked list doesn't know whether the next and previous elements are in A or B.
Then I loop over pair distances starting at d=1, and find all pairs with distance d, mark them, remove them from the hash, until there are no more elements of A to match.

The loop looks like this:

for ($d=1; @a > 0; $d++) {
    @left = ();
    foreach $a in @a {
        $next = $a;
        # find closest b ahead of $a, stop searching if you pass $d
        while (exists $hash{$next}{next} && $next - $a < $d) {
            $next = $hash{$next}{next};
        }
        if ($next is in B && $next - $a == $d) {
            # found a pair at distance $d
            mark_in_measure($a, $next);
            remove_from_linked_list($next);
            remove_from_linked_list($a);
            next;
        }

        # do same thing looking behind $a
        $prev = $a;
        ...

        # you didn't find a match for $a
        push @left, $a;
    }
    @a = @left;
}

This loop obviously prefers pairs that match b's that appear later than a's; I can't tell whether there's a sensible way to decide whether later is better than prior (better in terms of getting closer pairs). The main optimization I'm interested in is processing time.

Mudstone answered 12/10, 2011 at 15:50 Comment(7)

You say that with A = {1, 3, 4}, B = {1, 5, 6, 7}, you get a,b pairs 1,1, 4,5, 3,6, i.e., one pair each at distances 0, 1, and 3. Isn't the matchup 1,1, 3,5, 4,6, with distances 0, 2, 2 better? – Millett 12/10, 2011 at 16:20

en.wikipedia.org/wiki/Levenshtein_distance – Yseult 12/10, 2011 at 16:45

What do you use to cost the matchings though? Minimum sum of distances? Minimum sum of the square of distances? – Maegan 12/10, 2011 at 16:53

@jwpat7 I'm not sure if it is better since the total distance between pairs remains constant. It could be that it's possible to construct A and B such that always preferring shorter pairs would result in a higher sum of pair-distances, though it's not obvious to me how you would go about it. Even so, this simple greedy algorithm is good enough for my application. If optimization isn't very complicated, then I'd be interested in how you'd do it. – Mudstone 12/10, 2011 at 16:58

@missingno a cost function approach would be interesting to see, but I'm ok with an algorithm that always prefer shorter matches. – Mudstone 12/10, 2011 at 16:59

@flies: The problem is that saying "prefers shorter matches" is completely ambiguous. If you don't give a way to rate your matching then how do you say one would be better than the other? – Maegan 12/10, 2011 at 17:36

@missingo if there's a way to shuffle pairs so as to improve the distance of one pair, going from D to d, then you should do that so long as the shuffle doesn't disturb any pairs with distance less than d. – Mudstone 12/10, 2011 at 18:42

Sounds like you have a particular case of the Assignment Problem (finding a minimum matching in a weighted bipartite graph).

The algorithm to solve the assignment problem is too slow for you at O(N^3) but I'm pretty sure there you can shave some of this complexity off by exploiting your particular weight function or how you only want a histogram instead of the exact matching.

Maegan answered 12/10, 2011 at 16:59 Comment(0)

#!/usr/bin/perl

use strict;
use warnings FATAL => 'all';
use diagnostics;  

# http://www.hungarianalgorithm.com/solve.php?c=3-2-6-22--7-2-2-18--13-8-4-12--23-18-14-2&random=1
# https://www.topcoder.com/community/data-science/data-science-tutorials/assignment-problem-and-hungarian-algorithm/
# http://www.cse.ust.hk/~golin/COMP572/Notes/Matching.pdf

my @mat;
my @out_mat;

my $spaces    = 6;
my $precision = 0;

my $N = 10;
my $M = 12;
my $r = 100;

my @array1; my @array2;

for my $i (1..$N) {
    push @array1, sprintf( "%.${precision}f",  rand($r)  );
}

for my $i (1..$M) {
    push @array2, sprintf( "%.${precision}f",  rand($r)  );
}

#@array1 = ( 1, 3, 4);      # $mat[i]->[j] = abs( array1[i] - array2[j] )
#@array2 = ( 1, 5, 6, 7);

#                1     5     6     7

#     1     [    0*    4     5     6 ]

#     3     [    2     2*    3     4 ]

#     4     [    3     1     2*    3 ]

my $min_size  = $#array1 < $#array2 ? $#array1 : $#array2;
my $max_size  = $#array1 > $#array2 ? $#array1 : $#array2;

for (my $i = 0; $i < @array1; $i++){
   my @weight_function;
   for (my $j = 0; $j < @array2; $j++){
      my $dif = sprintf( "%.${precision}f", abs ($array1[$i] - $array2[$j])  );
      #my $dif = sprintf( "%.${precision}f", ($array1[$i] - $array2[$j])**2  ); 
      push @weight_function, $dif;
   }
   push @mat, \@weight_function;
}


# http://cpansearch.perl.org/src/TPEDERSE/Algorithm-Munkres-0.08/lib/Algorithm/Munkres.pm

Algorithm::Munkres::assign(\@mat,\@out_mat);


print "\n\@out_mat index  = [";
for my $index (@out_mat) {
   printf("%${spaces}d", $index);
}
print " ]\n";

print "\@out_mat values = [";

my %hash;
for my $i (0 .. $max_size){
   my $j = $out_mat[$i];
   last if ( $i > $min_size and $#array1 < $#array2 );
   next if ( $j > $min_size and $#array1 > $#array2 );
   my $dif = $mat[$i]->[$j];
   printf( "%${spaces}.${precision}f", $dif );
   $hash{ $dif } { $i } { 'index_array1' } = $i;
   $hash{ $dif } { $i } { 'index_array2' } = $j;
   $hash{ $dif } { $i } { 'value_array1' } = $array1[$i];
   $hash{ $dif } { $i } { 'value_array2' } = $array2[$j]; 
}

print " ]\n\n";


my $soma_da_dif = 0;

foreach my $min_diferenca ( sort { $a <=> $b } keys %hash ){
   foreach my $k ( sort { $a <=> $b } keys %{$hash{$min_diferenca}} ){
      $soma_da_dif += $min_diferenca;
      my $index_array1 = $hash{ $min_diferenca } { $k } { 'index_array1' };
      my $index_array2 = $hash{ $min_diferenca } { $k } { 'index_array2' };
      my $value_array1 = $hash{ $min_diferenca } { $k } { 'value_array1' };
      my $value_array2 = $hash{ $min_diferenca } { $k } { 'value_array2' };
      printf( "   index (%${spaces}.0f,%${spaces}.0f), values (%${spaces}.${precision}f,%${spaces}.${precision}f), dif = %${spaces}.${precision}f\n", 
              $index_array1, $index_array2, $value_array1, $value_array2, $min_diferenca );

   }
}
print "\n\nSum = $soma_da_dif\n";





#-------------------------------------------------#
#------------------ New-Package ------------------# 

{ # start scope block

package Algorithm::Munkres;

use 5.006;
use strict;
use warnings;

require Exporter;
our @ISA = qw(Exporter);
our @EXPORT = qw( assign );
our $VERSION = '0.08';

...
... <---- copy all the 'package Algorithm::Munkres' here
...

return $minval;
}

1;  # don't forget to return a true value from the file

} # end scope block

Melamie answered 14/10, 2016 at 2:47 Comment(0)

Recommended topics

Hot tags