Spatial matching of big datasets
Asked Answered
O

2

11

I have a dataset with about 100000 points and another dataset with roughly 3000 polygons. For each of the points I need to find the nearest polygon (spatial match). Points inside a polygon should match to that polygon.

Computing all-pairs distances is feasible, but takes a bit longer than necessary. Is there an R package that will make use of a spatial index for this kind of matching problem?

I am aware of the sp package and the over function, but the documentation doesn't tell anything about indexes.

Owens answered 13/4, 2013 at 19:46 Comment(4)
What do you mean by "spatial index"?Calfee
@RomanLuštrik: I mean a data structure like a kd-tree, see e.g. en.wikipedia.org/wiki/Spatial_index#Spatial_index. This data structure would accelerate lookup in the 3000-polygon dataset.Owens
the rgeos package is usually your best bet for geometry operations. I'm pretty sure it uses spatial indexes when appropriate. Based on the GEOS C library.Renita
Related: gis.stackexchange.com/q/396/4630Elephus
Z
5

You could try and use the gDistance function in the rgeos package for this. As an example look at the below example, which I reworked from this old thread. Hope it helps.

require( rgeos )
require( sp )

# Make some polygons
grd <- GridTopology(c(1,1), c(1,1), c(10,10))
polys <- as.SpatialPolygons.GridTopology(grd)

# Make some points and label with letter ID
set.seed( 1091 )
pts = matrix( runif( 20 , 1 , 10 ) , ncol = 2 )
sp_pts <- SpatialPoints( pts )
row.names(pts) <- letters[1:10]

# Plot
plot( polys )
text( pts , labels = row.names( pts ) , col = 2 , cex = 2 )
text( coordinates(polys) , labels = row.names( polys ) , col = "#313131" , cex = 0.75 )

enter image description here

# Find which polygon each point is nearest
cbind( row.names( pts ) , apply( gDistance( sp_pts , polys , byid = TRUE ) , 2 , which.min ) )
#   [,1] [,2]
#1  "a"  "86"
#2  "b"  "54"
#3  "c"  "12"
#4  "d"  "13"
#5  "e"  "78"
#6  "f"  "25"
#7  "g"  "36"
#8  "h"  "62"
#9  "i"  "40"
#10 "j"  "55"
Zetland answered 13/4, 2013 at 22:20 Comment(4)
@Owens any help or is this too slow for your large datasets?Coast
Took a bit of effort to install rgeos on the most "recent" Debian, see github.com/rundel/rgeos/issues/1. Will try later tonight.Owens
Well, the method you suggested still computes all-pairs distances. Takes 16 minutes for my data -- not too slow, but still. A workaround is to use first gContains and then gDistance on the remaining (few) records.Owens
I think the OP tries to find something more clever than trying all combinations of distances. Maybe some optimized matching operation based on nearest distance.Elephus
V
-1

I don't know anything about R but I will offer one possible solution using PostGIS. You may be able to load the data in PostGIS and process it faster than you can using R alone.

Given two tables planet_osm_point (80k rows) and planet_osm_polygon (30k rows), the following query executes in around 30s

create table knn as 
select 
    pt.osm_id point_osm_id, 
    poly.osm_id poly_osm_id
from planet_osm_point pt, planet_osm_polygon poly
where poly.osm_id = (
    select p2.osm_id 
    from planet_osm_polygon p2 
    order by pt.way <-> p2.way limit 1
);

The result is an approximation based on the distance between the point and the centre-point of the polygon's bounding box (not the centre-point of the polygon itself). With a bit more work this query can be adapted to get the nearest polygon based on the centre-point of the polygon itself although it won't execute as quickly.

Vardar answered 20/11, 2013 at 13:10 Comment(1)
Thanks for the PostGIS code, but I'm really interested if R has similar capabilities (especially w.r.t. run time).Owens

© 2022 - 2024 — McMap. All rights reserved.