optimize nearest neighbor query on 70 million extremely high density spatial point cloud on SQL Server 2008

Asked 11/7, 2011 at 18:42 Answered 23/8, 2011 at 16:58

sql-server-2008 indexing spatial closest-points lidar

I have about 75 million records in a SQL Server 2008 R2 Express database. Each is a lat long corresponding to some value. The table has geography column. I am trying to find one nearest neighbor for a given latitude longitude (point). I already have a query with spatial index in place. But depending on where the record is in the database, say first quarter or last quarter, the query can take about from 3 to 30 seconds to find the nearest neighbor. I feel this can be optimized to give lot faster result by optimizing the query or spatial index. Right now applied some the spatial index with default settings. Here is what my table and query looks like.

CREATE TABLE lidar(
    [id] [bigint] IDENTITY(1,1) NOT NULL,
    [POINTID] [int] NOT NULL,
    [GRID_CODE] [numeric](17, 8) NULL,
    [geom] [geography] NULL,
 CONSTRAINT [PK_lidar_1] PRIMARY KEY CLUSTERED ([id] ASC)
 WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, 
 ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

The spatial Index i am using:

CREATE SPATIAL INDEX [SPATIAL_lidar] ON [dbo].[lidar] ([geom]) USING  GEOGRAPHY_GRID 
WITH (
GRIDS =(LEVEL_1 = MEDIUM,LEVEL_2 = MEDIUM,LEVEL_3 = MEDIUM,LEVEL_4 = MEDIUM), 
CELLS_PER_OBJECT = 16, PAD_INDEX  = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF,  
ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]

Here is the Query I am using:

declare @ms_at geography = 'POINT (-95.66 30.04)';
select TOP(1) nearPoints.geom.STAsText()as latlon 
from
(
select r.geom
from lidar r With(Index(SPATIAL_lidar))
where r.geom.STIntersects(@ms_at.STBuffer(1000)) = 1
) nearPoints

Here is a sample of lat longs in my database . to give an idea of accuracy and density. All the 70 million records are for one city (Lidar Data)

POINT (-95.669434934023087 30.049513838913736)

Now this query gives me results as i described above, but i want to improve the performance as much as possible. My guess is by tweaking the default values of the spatial index i might be able to improve the performance. Any clues this?

I tried varying the buffer from 10 to 1000 but with almost same results.

Also any other suggestions to improve the performance are welcome.

Here is the system i am using right now:

Windows 7 64bit Professional
Intel(R) Core(TM)2 Quad CPU    Q9650  @ 3.00GHz (4 CPUs), ~3.0GHz
Ram: 8 GB
NVIDIA GeForce 9500 GT

Kelbee answered 11/7, 2011 at 18:42 Comment(0)

Sorry, but this is not a SQL answer, but a way of getting predictable performance assuming certain constraints on your data.

How often is the data changing? If possible, could you pre-calculate a graph of every entity 5 nearest neighbors, and use that to speed up your selection.?

If this data is mostly read only, then...

How evenly are these points distributed? If fairly evenly and well know distribution, then could you roll your own spacial mapping by bucketing every coordinate and index in a hash table.

If you don't need to have the data in the database, move it out to a memory mapped file for fast hash lookups. (70m records should easily fit in memory).

I have use this architecture to generate sub-millisecond lookups for for display advertising and search engine relevance.

==Elaboration==

You simply create a grid of fixed size squares (like a chessboard), and you map each point into the grid, and you create a list of the objects witch belong into each of the grid-boxes -- if you adjust the size of each box correctly, you should on average have 5-50 points in each square -- This is in principle a quad-tree but without the tree for simplicity.

For each bucket which is empty after you have scattered all the data in buckets, you add information of which nearest buckets that contain data.

You can now number each bucket left-to-right-line-ny-line so that each bucket have a unique number which can be calculated from the coordinates -- and you insert each bucket into a hash table, or if space permitting just as a straight lookup table.

Now when you have your query, you simply calculate which bucket that will map to, and you will get either a list of objects in that bucket, or you will get a 'empty' bucket which contains the pointers to the nearest bucket that have content.

That will give you you first candidate list of objects that you are looking for, and now you simply just have to run though and see whcih one is the closest.

In 99% of cases that would be it -- but if you are worried about that there (a) either are some condidates in the next-over buckets which are actually closer, then just check the 8 surrounding buckets, and see if you can find any closer there.

If you now also want to get a list of all the objects which are closest, then also calculate a simple network of 5 nearest neigbours for each obejct, so you will end up with a data structure like A->{B,C,D,E,F}, B->{A,D,G,H,I}, C->{A,J,K,G,M}....

That will form a simple network which you can now traverse with a variation of Dijkstra here to get all the point adjacent to you nearest point.

Building the data structures will take time, but once done, and done right lookup and returning a dataset can be done in sub milliseconds (not including any http or off-box communication ofcause)

Hope this helps.

Dyak answered 11/7, 2011 at 20:43 Comment(2)

the data is read only : can you elaborate a bit more on how do i go about doing "move it out to a memory mapped file for fast hash lookups". Do you predict sub millisecond performance even when i have to calculate nearest point? because my input lat long may not be in the list as it is right? – Kelbee 11/7, 2011 at 21:7

Added a longer description -- hope your coding skills are good :-) – Dyak 11/7, 2011 at 22:31

For the regular nearest neighbor query on SQL Server 2008, try the approach that Isaac has documented on his blog which uses a numbers table to increase the bounds of the lookup until enough candidates have been found. The other recommendation would be to try varying your grid densities - HHHH or HHMM would likely work better for dense points.

In Sql Server Denali, this functionality for optimized nearest neighbor queries will also be supported natively using regular SELECT TOP ... ORDER BY syntax.

As a sidenote, in your example it looks like you are missing the ORDER BY distance clause in your query to go along with the top? Your current query will return a point that is within 1000 meters of the target, but not necessarily the nearest one.

Thury answered 12/7, 2011 at 14:14 Comment(0)

Have you tried playing around with the index settings? I generally use higher objects per cell and "high" for all the Grid levels. See below:

CREATE SPATIAL INDEX [SPATIAL_lidar] ON [dbo].[lidar] ([geom]) USING  GEOGRAPHY_GRID  WITH ( GRIDS =(LEVEL_1 =   HIGH,LEVEL_2 = HIGH,LEVEL_3 = HIGH,LEVEL_4 = HIGH),  CELLS_PER_OBJECT = 64, PAD_INDEX  = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF,   ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]

Martelli answered 13/7, 2011 at 8:23 Comment(0)

Read this article on spatial indexing.

My guess is that your primary filter is very inefficient. You probably need your grid density to be HIGH for the first level since you're points are very dense. One problem I've been struggling with is how to make LEVEL1 have a density greater than 256 (HIGH). I'm amazed Microsoft forced us to only have 3 options for grid density.

Allopathy answered 23/8, 2011 at 16:58 Comment(0)

Recommended topics

Hot tags