I have two sets of integers A
and B
(size of A
less than or equal to B
), and I want to answer the question, "How close is A
to B
?". The way I want to answer this question is by producing a measure of how far you have to go from a given a
in A
to find a b
in B
.
The specific measure I want to produce does the following: for each a
, find the closest b
, the only catch being that once I match a b
with an a
, I can no longer use that b
to match any other a
's. (EDIT: the algorithm I'm trying to implement will always prefer a shorter match. So if b
is the nearest neighbor to more than one a
, pick the a
nearest to b
. I'm not sure what to do if more than one a
has the same distance to b
, right now I'm picking the a
that precedes b
, but that's quite arbitrary and not necessarily optimal.) The measure that I'll for make these sets, the final product, is a histogram showing the number of pairs in the vertical axis and the distance of the pairs in the x-axis.
So if A = {1, 3, 4}
and B = {1, 5, 6, 7}
, I will get the following a,b
pairs: 1,1
, 4,5
, 3,6
. For these data, the histogram should show one pair with distance zero, one pair with distance 1, and one pair with distance 3.
(The actual size of these sets has an upper bound around 100,000 elements, and I read them in from disk already sorted low to high. The integers range from 1 to ~20,000,000. EDIT: also, the elements of A
and B
are unique, i.e. no repeated elements.)
The solution I've come up with feels a bit clunky. I'm using Perl, but the problem is more or less language agnostic.
First I make a hash, with one key for each number that appears in the union of
A
andB
and values indicating whether each number appears inA
,B
, or both, e.g.$hash{5} = {a=>1, b=>1}
if the number 5 appears in both data-sets. (If it only appeared inA
, you'd have$hash{5} = {a=>1}
.)Next, I iterate over
A
to find all the hash elements that appear inA
andB
, mark them in the measure, and remove them from the hash.Then, I sort all the hash keys and make each element of the hash point to its nearest neighbors, like a linked list, where a given hash element now looks like
$hash{6} = {b=>1, previous=>4, next=>8}
. The linked list doesn't know whether the next and previous elements are inA
orB
.Then I loop over pair distances starting at
d=1
, and find all pairs with distanced
, mark them, remove them from the hash, until there are no more elements ofA
to match.
The loop looks like this:
for ($d=1; @a > 0; $d++) {
@left = ();
foreach $a in @a {
$next = $a;
# find closest b ahead of $a, stop searching if you pass $d
while (exists $hash{$next}{next} && $next - $a < $d) {
$next = $hash{$next}{next};
}
if ($next is in B && $next - $a == $d) {
# found a pair at distance $d
mark_in_measure($a, $next);
remove_from_linked_list($next);
remove_from_linked_list($a);
next;
}
# do same thing looking behind $a
$prev = $a;
...
# you didn't find a match for $a
push @left, $a;
}
@a = @left;
}
This loop obviously prefers pairs that match b
's that appear later than a
's; I can't tell whether there's a sensible way to decide whether later is better than prior (better in terms of getting closer pairs). The main optimization I'm interested in is processing time.
A
andB
such that always preferring shorter pairs would result in a higher sum of pair-distances, though it's not obvious to me how you would go about it. Even so, this simple greedy algorithm is good enough for my application. If optimization isn't very complicated, then I'd be interested in how you'd do it. – Mudstone