Selecting a good SQL Server 2008 spatial index with large polygons
Asked Answered
H

3

12

I'm having some fun trying to pick a decent SQL Server 2008 spatial index setup for a data set I am dealing with.

The dataset is polygons, representing contours over the whole globe. There are 106,000 rows in the table, the polygons are stored in a geometry field.

The issue I have is that many of the polygons cover a large portion of the globe. This seems to make it very hard to get a spatial index that will eliminate many rows in the primary filter. For example, look at the following query:

SELECT "ID","CODE","geom".STAsBinary() as "geom" FROM "dbo"."ContA"
WHERE "geom".Filter(
  geometry::STGeomFromText('POLYGON ((-142.03193662573682 59.53396984952896,
    -142.03193662573682 59.88928136451884,
    -141.32743833481925 59.88928136451884,
    -141.32743833481925 59.53396984952896,
    -142.03193662573682 59.53396984952896))', 4326)
) = 1

This is querying an area which intersects with only two of the polygons in the table. No matter what combination of spatial index settings I chose, that Filter() always returns around 60,000 rows.

Replacing Filter() with STIntersects() of course returns just the two polygons I want, but of course takes much longer (Filter() is 6 seconds, STIntersects() is 12 seconds).

Can anyone give me any hints on whether there is a spatial index setup that is likely to improve on 60,000 rows or is my dataset just not a good match for SQL Server's spatial indexing ?

More info:

As suggested, I split the polygons up, using a 4x4 grid across the globe. I couldn't see a way to do it with QGIS, so I wrote my own query to do it. First I defined 16 bounding boxes, the first looked like this:

declare  @box1 geometry = geometry::STGeomFromText('POLYGON ((
-180 90,
-90 90,
-90 45,
-180 45,
-180 90))', 4326)

Then I used each bounding box to select and truncate the polygons that intersected that box:

insert ContASplit
select CODE, geom.STIntersection(@box1), CODE_DESC from ContA
where geom.STIntersects(@box1) = 1

I obviously did this for all 16 bounding boxes in the 4x4 grid. The end result is that I have a new table with ~107,000 rows (which confirms that I didn't actually have many huge polygons).

I added a spatial index with 1024 cells per object and low,low,low,low for the cells per level.

However, very oddly this new table with the split polygons still performs the same as the old one. Doing the .Filter listed above still returns ~60,000 rows. I really don't understand this at all, clearly I don't understand how the spatial index actually work.

Paradoxically, while .Filter() still returns ~60,000 rows, it has improved performance. The .Filter() now takes around 2 seconds rather than 6 and the .STIntersects() now takes 6 seconds rather than 12.

As requested here is an example of the SQL for the index:

CREATE SPATIAL INDEX [contasplit_sidx] ON [dbo].[ContASplit] 
(
    [geom]
)USING  GEOMETRY_GRID 
WITH (
BOUNDING_BOX =(-90, -180, 90, 180),
GRIDS =(LEVEL_1 = LOW,LEVEL_2 = LOW,LEVEL_3 = LOW,LEVEL_4 = LOW), 
CELLS_PER_OBJECT = 1024,
PAD_INDEX  = OFF,
SORT_IN_TEMPDB = OFF,
DROP_EXISTING = OFF,
ALLOW_ROW_LOCKS  = ON,
ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]

Though remember, I have tried a whole range of different settings for the grids and cells per object, with the same results each time.

Here are the results of running sp_help_spatial_geometry_index, this is on my split dataset where no single polygon occupies more than 1/16th of the globe:

Base_Table_Rows 215138 Bounding_Box_xmin -90 Bounding_Box_ymin -180 Bounding_Box_xmax 90 Bounding_Box_ymax 180 Grid_Size_Level_1 64 Grid_Size_Level_2 64 Grid_Size_Level_3 64 Grid_Size_Level_4 64 Cells_Per_Object 16 Total_Primary_Index_Rows 378650 Total_Primary_Index_Pages 1129 Average_Number_Of_Index_Rows_Per_Base_Row 1 Total_Number_Of_ObjectCells_In_Level0_For_QuerySample 1 Total_Number_Of_ObjectCells_In_Level0_In_Index 60956 Total_Number_Of_ObjectCells_In_Level1_In_Index 361 Total_Number_Of_ObjectCells_In_Level2_In_Index 2935 Total_Number_Of_ObjectCells_In_Level3_In_Index 32420 Total_Number_Of_ObjectCells_In_Level4_In_Index 281978 Total_Number_Of_Interior_ObjectCells_In_Level2_In_Index 1 Total_Number_Of_Interior_ObjectCells_In_Level3_In_Index 49 Total_Number_Of_Interior_ObjectCells_In_Level4_In_Index 4236 Total_Number_Of_Intersecting_ObjectCells_In_Level1_In_Index 29 Total_Number_Of_Intersecting_ObjectCells_In_Level2_In_Index 1294 Total_Number_Of_Intersecting_ObjectCells_In_Level3_In_Index 29680 Total_Number_Of_Intersecting_ObjectCells_In_Level4_In_Index 251517 Total_Number_Of_Border_ObjectCells_In_Level0_For_QuerySample 1 Total_Number_Of_Border_ObjectCells_In_Level0_In_Index 60956 Total_Number_Of_Border_ObjectCells_In_Level1_In_Index 332 Total_Number_Of_Border_ObjectCells_In_Level2_In_Index 1640 Total_Number_Of_Border_ObjectCells_In_Level3_In_Index 2691 Total_Number_Of_Border_ObjectCells_In_Level4_In_Index 26225 Interior_To_Total_Cells_Normalized_To_Leaf_Grid_Percentage 0.004852925 Intersecting_To_Total_Cells_Normalized_To_Leaf_Grid_Percentage 0.288147586 Border_To_Total_Cells_Normalized_To_Leaf_Grid_Percentage 99.70699949 Average_Cells_Per_Object_Normalized_To_Leaf_Grid 405.7282349 Average_Objects_PerLeaf_GridCell 0.002464704 Number_Of_SRIDs_Found 1 Width_Of_Cell_In_Level1 2.8125 Width_Of_Cell_In_Level2 0.043945313 Width_Of_Cell_In_Level3 0.000686646 Width_Of_Cell_In_Level4 1.07E-05 Height_Of_Cell_In_Level1 5.625 Height_Of_Cell_In_Level2 0.087890625 Height_Of_Cell_In_Level3 0.001373291 Height_Of_Cell_In_Level4 2.15E-05 Area_Of_Cell_In_Level1 1012.5 Area_Of_Cell_In_Level2 15.8203125 Area_Of_Cell_In_Level3 0.247192383 Area_Of_Cell_In_Level4 0.003862381 CellArea_To_BoundingBoxArea_Percentage_In_Level1 1.5625 CellArea_To_BoundingBoxArea_Percentage_In_Level2 0.024414063 CellArea_To_BoundingBoxArea_Percentage_In_Level3 0.00038147 CellArea_To_BoundingBoxArea_Percentage_In_Level4 5.96E-06 Number_Of_Rows_Selected_By_Primary_Filter 60956 Number_Of_Rows_Selected_By_Internal_Filter 0 Number_Of_Times_Secondary_Filter_Is_Called 60956 Number_Of_Rows_Output 2 Percentage_Of_Rows_NotSelected_By_Primary_Filter 71.66655821 Percentage_Of_Primary_Filter_Rows_Selected_By_Internal_Filter 0 Internal_Filter_Efficiency 0 Primary_Filter_Efficiency 0.003281055

"Base_Table_Rows 215138" doesn't make much sense to me, there are 107,000 rows in the table, not 215,000

When rendered the data set looks like this: alt text
(source: norman.cx)

Further research:

I continue to be puzzled by the poor performance of the primary filter with this data. So I did a test to see exactly how my data splits up. With my original unsplit features I added a "cells" column to the table. I then ran 16 queries to count how many cells in a 4x4 grid the feature spanned. So I ran a query like this for each cell:

declare  @box1 geometry = geometry::STGeomFromText('POLYGON ((
-180 90,
-90 90,
-90 45,
-180 45,
-180 90))', 4326)
update ContA set cells = cells + 1 where
geom.STIntersects(@box1) = 1

If I then look at the "cells" column in the table there are only 672 features in the whole of my data set that intersect with more than 1 cell within the 4x4 grid. So how on Earth, quite literally, can the primary filter be returning 60,000 features for a query looking at a small 200 mile wide rectangle ?

At this point it looks like I could write my own indexing scheme that would work better that how SQL Server's is performing for these features.

Hackworth answered 27/5, 2010 at 12:14 Comment(14)
Is the spatial index definitely being used in the query? blogs.msdn.com/b/isaac/archive/2008/08/29/…Cnut
Yes, the spatial index is definitely being used.Hackworth
It sounds as though you have a wide range of polygon sizes. If you only need to display data then consider creating a tile cache as a one off process, or converting your data to a raster dataset.Cnut
Unfortunately in the situation I am using the dataset it is impractical to render the entire tile cache down to the level that the users will be able to zoom in to.Hackworth
I'd recommend getting in touch with the developers of SQL Server 2008 Spatial. They often post on geospatial blogs and seem open to comments etc. Try the "Email Blog Author" button at blogs.msdn.com/b/isaac/archive/2008/03/01/…Cnut
I already posted the same question over on the TechNet SQL Server Spatial, but haven't had any replies yet. I'll try the blog too, thanks.Hackworth
It looks like you have 60,956 messed up features. Total_Number_Of_Border_ObjectCells_In_Level0_In_Index 60956 indicates that all these features touch your top level bounding box. Can you try selecting these with an intersect to inspect the geometry for these features? Also is this dataset publically available?Cnut
Running select COUNT(*) from ContAsplit where geom.STEnvelope().STPointN((1)).STX = -180 or geom.STEnvelope().STPointN((3)).STX = 180 or geom.STEnvelope().STPointN((1)).STY = -90 or geom.STEnvelope().STPointN((3)).STY = 90 shows that 329 of my features touch the bbox. I have already run a query that shows none of them stray outside of the bbox.Hackworth
The data is the ContA contour shapefile from Collins Bartholomew World Regular data set. So no, it isn't publicly available without a licence. I've added an example of what it looks like when rendered.Hackworth
I'd definitely consider making a large TIFF of the rendered data, and serving this out as a WMS. It would be faster than any vector / index dataset. How did you import the data to SQL Server? A lot of the free tools are not great. Maybe try an import with FME, and also try a different projection - EPSG 900913 / 3785.Cnut
I can't see how I could practically do that. We are serving out tiles to Bing Maps. We are supporting zooming down to at least level 10. I make that a TIFF of 1,048,576 x 1,048,576 pixels in size !Hackworth
I've done a little bit more research and am more puzzled than ever, see the updated question.Hackworth
What are the results for your new query if you use Filter rather than intersects? What is the resolution of your dataset? You'd only need to create a raster with resolution equal to the accuracy of the original data. There's no need to create pixels representing 1m of land if the contour dataset is accurate to 5km.Cnut
Wish I'd paid more attention to Bounding_Box_xmin -90 Bounding_Box_ymin -180 Bounding_Box_xmax 90 Bounding_Box_ymax 180!Cnut
C
13

In your index query you use:

CREATE SPATIAL INDEX [contasplit_sidx] ON [dbo].[ContASplit] 
(
    [geom]
)USING  GEOMETRY_GRID 
WITH (
BOUNDING_BOX =(-90, -180, 90, 180),
...

The BOUNDING_BOX therefore maps to:

xmin = -90
ymin = -180
xmax = 90
ymax = 180
  • Longtitude (-180 to 180 - designating East / West of the Meridian) should map to X
  • Latitude (-90 to 90 - designating how far North or South of the Equator) should map to Y

So to create the BOUNDING_BOX for the world you should use:

CREATE SPATIAL INDEX [contasplit_sidx] ON [dbo].[ContASplit] 
(
    [geom]
)USING  GEOMETRY_GRID 
WITH (
BOUNDING_BOX =(-180, -90, 180, 90),
...

This should create an index that fits your data and means all your features are covered by the index.

Cnut answered 7/6, 2010 at 10:21 Comment(1)
Thank you, thank you, thank you. It would appear that the tool I used to import the shape files created the indexes with the wrong bounding box initially and I have been slavishly copying the same error ever since.Hackworth
C
5

Splitting Data

If the query is for displaying data then you could split up your large polygons using a grid. These would be then very quick to retrieve with an index. You could remove the outlines so the features would still look contiguous.

Most commercial GIS packages will have tools to split one polygon dataset by another. Search for tools that do intersections.

If you are using OpenSource then have a look at QGIS and http://www.ftools.ca which "perform geoprocessing operations including intersections, differencing, unions, dissolves, and clipping." I've not used the latter myself.

Have a look at: http://postgis.refractions.net/docs/ch04.html#id2790790 for why large features are bad.

Filter and Intersects

There is more on the Filter clause here - Link

Spatial Indexes

Something else to check is that the spatial index is actually being used in the query plan. You may have to force the query to use the index with the WITH clause:

Link

More details on indexes below:

Link

Also try running sp_help_spatial_geometry_index for your data to see what settings to use for your spatial index

http://msdn.microsoft.com/en-us/library/cc627426.aspx

Running this SP with some test geometry produces all sorts of statistics to try and tailor your index to your data. A full list of properties is at http://msdn.microsoft.com/en-us/library/cc627425.aspx

These include values such as:

  • CellArea_To_BoundingBoxArea_Percentage_In_Level1
  • Number_Of_Rows_Selected_By_Primary_Filter

Messed Up Geometry

From the results of sp_help_spatial_geometry_index it looks like you may have issues with the geometry itself rather than the spatial index.

The Base_Table_Rows count looks to be a bug - http://connect.microsoft.com/SQLServer/feedback/details/475838/number-of-rows-in-base-table-incorrect-in-sp-help-spatial-geography-index-xml It may be worth recreating table / database and trying the index from scratch.

Total_Number_Of_ObjectCells_In_Level0_In_Index 60956 is a lot of features to return at level 0. It is likely they are either outside the spatial index extent or nulls. It then runs the Intersect (Number_Of_Times_Secondary_Filter_Is_Called 60956) on all these features which would explain why it is slow. Even though the docs claim no performance hit for null features - I believe it still has to look up the records, even if no intersect is performed.

NULL and empty instances are counted at level 0 but will not impact performance. Level 0 will have as many cells as NULL and empty instances at the base table.

The Primary_Filter_Efficiency of 0.003281055 I believe indicates 0.03% efficiency!

A few things to try:

  1. Anything strange from SELECT * FROM sys.spatial_indexes?

  2. The MakeValid statement:

    UPDATE MyTable SET GeomFieldName = GeomFieldName.MakeValid()

  3. Reset / double check SRID:

    UPDATE MyTable SET GeomFieldName.STSrid = 4326

  4. Add in some fields to show the extents of your features. This may highlight issues / NULL geometries.

    ALTER TABLE MyTable ADD MinX AS (CONVERT(int,GeomFieldName.STEnvelope().STPointN((1)).STX,0)) PERSISTED ALTER TABLE MyTable ADD MinY AS (CONVERT(int,GeomFieldName.STEnvelope().STPointN((1)).STY,0)) PERSISTED ALTER TABLE MyTable ADD MaxX AS (CONVERT(int,GeomFieldName.STEnvelope().STPointN((3)).STX,0)) PERSISTED ALTER TABLE MyTable ADD MaxY AS (CONVERT(int,GeomFieldName.STEnvelope().STPointN((3)).STY,0)) PERSISTED

Cnut answered 30/5, 2010 at 19:17 Comment(16)
The query is being generated by GeoServer that I am using to generate tiles that are rendered using Bing maps. Splitting up the polygons was one thing I had already considered doing, but had put off so far as I know it will take me a day or two to perfect the query/build a tool to do it. The query time for Filter is 50% lower than the query time for STIntersects.Hackworth
I am still puzzled as to why the Filter returns quite so many records. There are lots of large polygons in the data set, but the description of how the SQL indexes work would lead me to expect a few thousand returned by the filter, not 60,000+ (60% of the whole data set).Hackworth
I just did a quick query to see exactly how many truly huge polygons I have. Looking at the area of the envelope of each polygon, only 3 of the 106,000 polygons occupy more than 50% of the globe and only 23 of them are bigger than one eight of the globe.Hackworth
I realised I misunderstood your question about the query time. Combining a Filter with a STIntersects takes the same time as an STIntersects by itself, whether you combine them with a sub query or adding both to the WHERE clause. This isn't a surprise as what I have read says that STIntersects does a Filter internally anyway when there is a spatial index available. Just doing a Filter take 6 seconds, anything involving adding STIntersects takes 12 seconds.Hackworth
I've added links to how the Filter works, and splitting the polygons - there should be no need to develop your own tools to do this.Cnut
Thanks for the links, I didn't know QGIS could do thinks like that (I've used it a bit).Hackworth
However I still don't understand why the index doesn't cut down the results from Filter more. If I select "Low" for the number of cells in the top level grid there will be 16 cells in the grid. My inspection of the data suggests only 23 of my polygons should appear in more than two of those cells. My target of the query will only appear in one of the cells. So why does the Filter return 60,000 rows ?Hackworth
The Filter may be using the bounding box of the polygon to check if it is in an index cell. If you drew a box round the extent of every feature would this account for the difference?Cnut
No, I don't think so. The way I calculated how many huge polygons I have is by doing .STEnvelope().STArea() on each one. That was how I got the result that only 23 cover more than 1/8th of the globe.Hackworth
I have done the splitting, ended up doing it myself, couldn't see how QGIS could do it (plus it is hugely slow on this data set). It didn't give me the results I expected, see the extra detail I added to the question.Hackworth
I have tried a huge range of settings for the index settings. Yes I have tried MEDIUM and 16 cells per object, as that is the default that SQL Management Studio uses.Hackworth
I have already tried running sp_help_spatial_geometry_index before and then tried tweaking the index settings. But whatever I did, even going from one extreme to the other still resulted in about 60,000 rows being returned by .Filter()Hackworth
Nothing strange looking in sys.spatial_indexes. MakeValid has already been used on this data. I checked the Srids, they were all correct, I reset them for good measure. I used your extra fields to check the geometries, none of them are null and they all fit within the bounds of my index.Hackworth
Thanks for your efforts, I can only assume SQL Server really doesn't like my data or I have hit some sort of bug.Hackworth
One final thing - have you tried with the GEOGRAPHY type? As you are using 4326 and world-wide data it may be more suitable. Just make sure no features cross the equator (you may need to splt again)Cnut
I just tried to give geography a go. Unfortunately one or more (probably lots) of my geometry rows aren't valid for geography. I haven't looked to see which of the various rules for geography that they break, that would be a big effort. Also, I don't think the GeoServer SQL Server plugin supports geography data types yet. None of my features cross the equator since I split them using the 4x4 grid.Hackworth
M
0

I too have found it very difficult to "GUESS" what an appropriate spatial index will be for a particular table of geometries. I tried making more educated guesses using the sp_help_spatial_geometry_index stored procedure. All this did was tell me how poorly my spatial index was performing after each "GUESS". Even if I limited my options by only considering 2-8 CELLS_PER_OBJECT, that alone gives 567 permutations (3 types chosen 4 times = 81. Then multiply by 7 CELLS_PER_OBJECT options). I decided I was going to let SQL server do the experimenting for me and give me some empirical evidence. I created a stored procedure that would spin through the permutations and rebuild the spatial index on a spatial table for each one. Then it would test query performance of each permutation of the spatial index using two supplied geometry instances. I selected one geometry instance that included the entire data set and then another instance that included a smaller portion of the data set. The proc uses STIntersect() 4 times on each instance and then records the results in a table. You can then query the results table to find out which spatial index performed best on your particular data set. Give it a try and let me know if you have any suggested improvements or observations.

Create the proc using this https://gist.github.com/anonymous/5322650. Then set up an execution statement using this example:

/* set up some strings to be used to create geometry instances when our test spatial queries run */ 
DECLARE @ada VARCHAR(MAX) 
SET @ada = 'GEOMETRY::STGeomFromText(''POLYGON ((2422068 527322, 2422068 781170, 2565405 781170, 2565405 527322, 2422068 527322))'', 0)'
DECLARE @mer VARCHAR(MAX) 
SET @mer = 'GEOMETRY::STGeomFromText(''POLYGON ((2451235 696087, 2451235 721632, 2473697 721632, 2473697 696087, 2451235 696087))'', 0)'
DECLARE @mer1 VARCHAR(MAX) 
SET @mer1 = 'GEOMETRY::STGeomFromText(''POLYGON ((244386 712283, 2443866 717980, 2454872 717980, 2454872 712283, 244386 712283))'', 0)'
DECLARE @mer2 VARCHAR(MAX) 
SET @mer2 = 'GEOMETRY::STGeomFromText(''POLYGON ((2434259 687278, 2434259 701994, 2449657 701994, 2449657 687278, 2434259 687278))'', 0)'


EXEC gis.sp_tune_spatial_index 'PARCEL_ADA', 'S104_idx', 2, 8, @ada, @mer1 
GO

NOTE: Obviously, rebuilding a spatial index 567 times will take a long time. Kick it off command line or just let it run while you do other things. If it is a dataset you are going to use often and the geometries remain similar, it will be worth the time it takes to run the proc. Results table shows performance in milliseconds.

Miyokomizar answered 5/4, 2013 at 20:13 Comment(3)
Thanks for the answer! Rather than use Dropbox (which is blocked by many employers), is it possible to provide the gist of the stored procedure here? Or, you might consider using Github Gist (gist.github.com) as an alternative hosting location.Somnambulation
Thanks for the suggestion. Just updated the hyperlink to that site.Miyokomizar
@GreenGeo brilliant! I hacked up your stored proc to include some more output: geonet.esri.com/blogs/HackingArcSDE/2015/07/07/…Haemophilic

© 2022 - 2024 — McMap. All rights reserved.