How can I extend this SQL query to find the k nearest neighbors?
Asked Answered
I

2

9

I have a database full of two-dimensional data - points on a map. Each record has a field of the geometry type. What I need to be able to do is pass a point to a stored procedure which returns the k nearest points (k would also be passed to the sproc, but that's easy). I've found a query at http://blogs.msdn.com/isaac/archive/2008/10/23/nearest-neighbors.aspx which gets the single nearest neighbour, but I can't figure how to extend it to find the k nearest neighbours.

This is the current query - T is the table, g is the geometry field, @x is the point to search around, Numbers is a table with integers 1 to n:

DECLARE @start FLOAT = 1000; 
WITH NearestPoints AS
(
     SELECT TOP(1) WITH TIES *,  T.g.STDistance(@x) AS dist
     FROM Numbers JOIN T WITH(INDEX(spatial_index)) 
     ON T.g.STDistance(@x) < @start*POWER(2,Numbers.n)
     ORDER BY n
)
SELECT TOP(1) * FROM NearestPoints
ORDER BY n, dist

The inner query selects the nearest non-empty region and the outer query then selects the top result from that region; the outer query can easily be changed to (e.g.) SELECT TOP(20), but if the nearest region only contains one result, you're stuck with that.

I figure I probably need to recursively search for the first region containing k records, but without using a table variable (which would cause maintenance problems as you have to create the table structure and it's liable to change - there're lots of fields), I can't see how.

Isocyanide answered 26/3, 2010 at 11:43 Comment(3)
What affect does changing the INNER query to more than TOP(1) have on the results when finding k records? (when the nearest region only contains one result)Chemoprophylaxis
If you change the inner query to select more regions, you can get more results, but this doesn't guarantee more results: the other regions may just contain the same single result (they increase in size exponentially) - e.g. imagine searching around a point which has one point nearby, but no other points for hundreds of kilometres around - the first n regions will just contain the same 1 point.Isocyanide
Was a working solution to this ever found? I'm looking for the same solution.Thrown
A
2

What happens if you remove TOP (1) WITH TIES from the inner query, and set the outer query to return the top k rows?

I'd also be interested to know whether this amendment helps at all. It ought to be more efficient than using TOP:

DECLARE @start FLOAT = 1000
        ,@k INT = 20
        ,@p FLOAT = 2;

WITH NearestPoints AS
(
     SELECT *
            ,T.g.STDistance(@x) AS dist
            ,ROW_NUMBER() OVER (ORDER BY T.g.STDistance(@x)) AS rn
     FROM Numbers 
     JOIN T WITH(INDEX(spatial_index)) 
     ON   T.g.STDistance(@x) <  @start*POWER(@p,Numbers.n)
     AND (Numbers.n - 1 = 0 
          OR T.g.STDistance(@x) >= @start*POWER(@p,Numbers.n - 1)
         )
)
SELECT * 
FROM NearestPoints
WHERE rn <= @k;

NB - untested - I don't have access to SQL 2008 here.

Abdullah answered 26/3, 2010 at 12:27 Comment(12)
@Isocyanide - Try the amendment I've made - perhaps there's an implicit cast to int in there somewhere (although I can't see it)Abdullah
It's a fault in the source query - POWER returns the datatype of its first argument (the constant 2 is interpreted as INT). Amended my query to reflect this.Abdullah
@Ed Harper, this works insofar that, if you set k to e.g. 40, you get 40 points back. but it returns duplicates if the points are spread over multiple regions, because region n contains all the points in region n -1. of course these could be filtered out afterward (the records have an ID field), but then you end up with less than k points! It does appear to be faster than using TOP, though!Isocyanide
@Isocyanide - I don't think the original query works as advertised for any value of k other than 1 - it's a fundamental problem of this approach. To avoid it, the inner query would need to search excluding the contents of all the previous groups already searched. This should be possible, but I don't have time to look at it right now (essentially, the ON statement in the join needs to consider a lower bound and an upper bound, rather than just an upper bound).Abdullah
@Isocyanide - I've made another amendment to reflect my previous note.Abdullah
@Ed Harper - this amendment doesn't quite work because POWER(2, 0) is 1, so you never consider points that are less than the value @float away. The second clause needs to be AND (Numbers.n - 1 = 0 OR T.g.STDistance(@x) >= @start*POWER(@p,Numbers.n - 1)) to get around this. However, I have uncovered another problem: because the results of the inner query are not ordered by distance, you can end up with bad results. e.g. when searching around (0,0): if @start is 100, and the number points at (0,0) > @k, points at (1,1) may be erroneously returned because they're within @start range.Isocyanide
I tried modifiying the OVER clause to (ORDER BY n, location.eastNorthGeom.STDistance(@g)), but this gives arithmetic overflow errors - is it possible to sort by a float column?Isocyanide
@Isocyanide - there's no restriction on sorting by float - there must be another implicit cast to INT in there somewhere - again, it's not immediately obvious. n isn't needed in the ORDER BY of the ranking function, so I've removed it in the code above. I've also incorporated your fix for the first distance band.Abdullah
@Ed Harper I found the problem - POWER(@p,Numbers.n) actually overflows float. Just needed a WHERE numbers.n < [sensible number] on the end of the inner query.Isocyanide
@Isocyanide - glad you got there in the end!Abdullah
@Henry - the questioner accepted the final answer (although, reading his comments, it required further modification) - so yes, I guess you could say that it has been testedAbdullah
@Ed Maybe you should edit the answer to include the WHERE clause suggested by Sighs @ Mar 29 at 12:11?Commonalty
C
2

Quoted from Inside Microsoft® SQL Server® 2008: T-SQL Programming. Section 14.8.4.

The following query will return the 10 points of interest nearest to @input:

DECLARE @input GEOGRAPHY = 'POINT (-147 61)';
DECLARE @start FLOAT = 1000;
WITH NearestNeighbor AS(
  SELECT TOP 10 WITH TIES
    *, b.GEOG.STDistance(@input) AS dist
  FROM Nums n JOIN GeoNames b WITH(INDEX(geog_hhhh_16_sidx)) -- index hint
  ON b.GEOG.STDistance(@input) < @start*POWER(CAST(2 AS FLOAT),n.n)
  AND b.GEOG.STDistance(@input) >=
    CASE WHEN n = 1 THEN 0 ELSE @start*POWER(CAST(2 AS FLOAT),n.n-1) END
  WHERE n <= 20
  ORDER BY n
)
  SELECT TOP 10 geonameid, name, feature_code, admin1_code, dist
  FROM NearestNeighbor
  ORDER BY n, dist;

Note: Only part of this query’s WHERE clause is supported by the spatial index. However, the query optimizer correctly evaluates the supported part (the "<" comparison) using the index. This restricts the number of rows for which the ">=" part must be tested, and the query performs well. Changing the value of @start can sometimes speed up the query if it is slower than desired.

Listing 2-1. Creating and Populating Auxiliary Table of Numbers

SET NOCOUNT ON;
USE InsideTSQL2008;

IF OBJECT_ID('dbo.Nums', 'U') IS NOT NULL DROP TABLE dbo.Nums;

CREATE TABLE dbo.Nums(n INT NOT NULL PRIMARY KEY);
DECLARE @max AS INT, @rc AS INT;
SET @max = 1000000;
SET @rc = 1;

INSERT INTO Nums VALUES(1);
WHILE @rc * 2 <= @max
BEGIN
  INSERT INTO dbo.Nums SELECT n + @rc FROM dbo.Nums;
  SET @rc = @rc * 2;
END

INSERT INTO dbo.Nums
  SELECT n + @rc FROM dbo.Nums WHERE n + @rc <= @max;
Commonalty answered 24/8, 2010 at 21:38 Comment(1)
I don't have access to the tools to test this answer any more, so I'm hesitant to mark it as the accepted answer over Ed's -- sorry!Isocyanide

© 2022 - 2024 — McMap. All rights reserved.