SQL efficient nearest neighbour query

Asked 6/4, 2009 at 9:25 Answered 8/8, 2018 at 5:24

I'm having trouble coming up with an efficient SQL query to handle the following situation:

Assume we have a table with two columns

groupId : int 
value : float

The table is huge (several million rows). There are a varying amount of "values" per "groupId" - say something between 100 and 50.000. All float values are greater or equal to zero but are otherwise unbounded.

For a given groupId the query should return all other groups sorted by decreasing similarity where "similar" is defined as minimum euclidian distance between all possible pairs of 30 values in two groups.

That definition of similarity is what kills me. I think for calculating similarity as defined above the naiive algorithm is O(n^2). Now I'm looking for ideas to either redefine "similarity" or an efficient implementation of the above. I could imagine a solution involving a k-nearest neighbour, something like PostGis geometrical nearest neighbours or maybe a largest common subsequence algorithm (although I'd need a "fuzzy" implementation of the latter because "values" will hardly ever compare exactly equal).

We are currently on mySQL in case it matters.

cheers,

Sören

Soil answered 6/4, 2009 at 9:25 Comment(7)

"There are a varying amount of "values" per "groupId" - say something between 100 and 50.000" and "all possible pairs of 30 values in two groups" confuses me. Can you clarify, or maybe give an idea of how the naive approach would work? – Inference 6/4, 2009 at 18:42

How many groups are you typically dealing with? – Death 6/4, 2009 at 19:52

Danbruc (the first answer below) describes the problem much better than I did. Maybe his analysis will clarify the problem? We currenlty have ~500 groups and ~1.800.000 values. We do hope to scale this to several 100.000 groups though. The current setup is just a small testcase. – Soil 7/4, 2009 at 7:55

When you say the "minimum euclidian distance between all possible pairs of 30 values", do you mean {10, 100} is closer to {10, 9999} (because they are 0 apart), or closer to {20, 90} (because the minimum total distance is 10 + 10 = 20) – Directrix 7/4, 2009 at 8:50

dist((10,100),(10,9999)) = sqrt((10-10)^2 + (9999-100)^2) = large dist((10,100),(20,90)) = sqrt((10-20)^2 + (100-90)^2) = smaller – Soil 7/4, 2009 at 10:10

Original comment on FryGuy's answer: This SQL statement might use some coordinates multiple times. Is this allowed or desired? For <1, 2, 3, 4> and <5, 100, 100, 100> the distance calculation will use the first coordinate 5 four times. – Hacksaw 7/4, 2009 at 23:2

Take a look at my answer and see if that gets you done. – Astray 8/8, 2018 at 5:25

Could you verify that I got the question right?

Your table represents vectors identified by the groupId. Every vector has a dimension of something between 100 and 50,000, but there is no order defined on the dimension. That is a vector from the table is actually a representative of equivalence class.

Now you define the similarity of two equivalence classes as the minimum Euclidian distance of the projections of any two representative of the equivalence classes to the subspace of the first 30 dimensions.

Examples for projection to two dimensions:

A = <1, 2, 3, 4>
B = <5, 6, 7, 8, 9, 10>

A represents the following equivalence class of vectors.

<1, 2, 3, 4>    <2, 1, 2, 3>    <3, 1, 2, 4>    <4, 1, 2, 3>
<1, 2, 4, 4>    <2, 1, 3, 2>    <3, 1, 4, 2>    <4, 1, 3, 2>
<1, 3, 2, 4>    <2, 3, 1, 4>    <3, 2, 1, 4>    <4, 2, 1, 3>
<1, 3, 4, 2>    <2, 3, 4, 1>    <3, 2, 4, 1>    <4, 2, 3, 1>
<1, 4, 2, 2>    <2, 4, 1, 3>    <3, 4, 1, 2>    <4, 3, 1, 2>
<1, 4, 3, 2>    <2, 4, 3, 1>    <3, 4, 2, 1>    <4, 3, 2, 1>

The projection of all representative of this equivalence class to the first two dimensions yields.

<1, 2>    <1, 3>    <1, 4>
<2, 1>    <2, 3>    <2, 4>
<3, 1>    <3, 2>    <3, 4>
<4, 1>    <4, 2>    <4, 3>

B represents a equivalence class with 720 elements. The projection to the first two dimensions yields 30 elements.

< 5, 6>    < 5, 7>    < 5, 8>    < 5, 9>    < 5, 10>
< 6, 5>    < 6, 7>    < 6, 8>    < 6, 9>    < 6, 10>
< 7, 5>    < 7, 6>    < 7, 8>    < 7, 9>    < 7, 10>
< 8, 5>    < 8, 6>    < 8, 7>    < 8, 9>    < 8, 10>
< 9, 5>    < 9, 6>    < 9, 7>    < 9, 8>    < 9, 10>
<10, 5>    <10, 6>    <10, 7>    <10, 8>    <10,  9>

So the distance of A and B is the square root of 8, because this is the minimum distance of two vectors from the projections. For example <3, 4> and <5, 6> yield this distance.

So, am I right with my understanding of the problem?

A really naive algorithm for n vectors with m components each would have to calculate (n - 1) distances. For each distance the algorithm would calculate the distances of m! / (m - 30)! projection for each vector. So for 100 dimensions (your lower bound) there are 2.65*10^32 possible projection for a vector. This requires to calculate about 7*10^64 distances between projections and finding the minimum to find the distance of two vectors. And then repeat this n times.

I hope that I misunderstood you or made a mistake. Else this sounds something between really challenging and not feasible.

Something I thought about is ordering the vector components and trying to match them. Using Manhattan distance - if possible - may help to simplify the solution.

Flutterboard answered 6/4, 2009 at 19:14 Comment(1)

Yes, you understood the problem perfectly and explained it much better than I did. Ordering the vectors is what I have been thinking about as well - hence my mention of LCS (longest common subsequence). I'll look into whether Manhattan distance might be of use for us. – Soil 7/4, 2009 at 7:57

Here are some nice approximations:

You could calculate the center of mass of each group and then compare based on the distance of each groups center of mass.

Another way you could do it is by hash the coordinates of each row and rows that hash to the same location are considered similar and thus the two groups similarity are updated.

Some more information would be helpful such as:

Is the information constantly being updated and if so at what interval. How up to date and how accurate does it need to be?

Sensory answered 7/4, 2009 at 1:47 Comment(3)

Center of mass in 1 dimension? Wouldn't that just be median or mean? Or do you mean center of mass of all possible 30 value vector permutations? Hashing would basically mean quantizing all values right? I.e. we'd snap all values to a grid? – Soil 7/4, 2009 at 7:59

Existing information is never updated - only new groups added. Say 100 per day. Accuracy would be nice but is not critical. This whole setup is a preprocessing step. The idea is to get the most likely "matches" from the database and proceed to test those with a much more expensive stand alone tool. – Soil 7/4, 2009 at 8:1

I didn't read the first answer which clears things up. I am not sure my answer is any good given this. – Sensory 7/4, 2009 at 8:49

All float values are greater or equal to zero but are otherwise unbounded.

If you want to do KNN on floats, use the btree_gist module for PostgreSQL and create a GIST index.

Also, for data types for which there is a natural distance metric, btree_gist defines a distance operator <->, and provides GiST index support for nearest-neighbor searches using this operator. Distance operators are provided for int2, int4, int8, float4, float8, timestamp with time zone, timestamp without time zone, time without time zone, date, interval, oid, and money.

float8 is double precision.

Astray answered 8/8, 2018 at 5:24 Comment(0)

The naive version would be something like this: (not run through query analyser)

select groupid, min(distance) as mindist
from
   (select other.groupid as groupid,
           min(abs(other.value - us.value)) as distance
    from g us
    join g other on other.groupid != us.groupid
    where us.groupid = ?)
order by mindist
group by groupid

Then, to take advantage of indicies:

select groupid, min(abs(value - usvalue)) as mindist
from
   (select other.groupid as groupid,
           max(other.value) as value,
           us.value as usvalue
    from g us
    join g other on other.groupid != us.groupid and other.value <= us.value
    where us.groupid = ?

    union

    select other.groupid as groupid,
           min(other.value) as value,
           us.value as usvalue
    from g us
    join g other on other.groupid != us.groupid and other.value >= us.value
    where us.groupid = ?)
order by mindist
group by groupid

This should hopefully allow mysql to use an index to quickly find the nearest neighbors on the join.

There might be errors in this, but hopefully this line of thought will help.

Directrix answered 7/4, 2009 at 2:21 Comment(5)

Thanks FryGuy. This is pretty much what we have tried but it doesn't scale at all. I'll experiment with variations of the above and post results. – Soil 7/4, 2009 at 8:2

do you have indicies on both groupid and value? – Directrix 7/4, 2009 at 8:46

yes. mySQL "explain" (the query execution plan) looks as good as it'll get as far as I can tell. – Soil 7/4, 2009 at 10:7

This SQL statement might use some coordinates multiple times. Is this allowed or desired? For <1, 2, 3, 4> and <5, 100, 100, 100> the distance calculation will use the first coordinate 5 four times. – Hacksaw 7/4, 2009 at 11:0

Well, I misread the question. This query will not answer your question, but order by the minimum distance chosen from the choices of all distance from group1 to group2 (not the sum of minimum distances). Should I remove this answer? – Directrix 8/4, 2009 at 2:16

Recommended topics

Hot tags