Improve speed/use of gDistance function by using parallel processing and/or plyr/dplyr?
Asked Answered
T

1

9

I have a bunch of locations for each of about 1000 individuals. The total dataset used to be around 2.5 million and my processing script took about 20 hours to run. Now however, I have 24 million observations and figure I need to clean up my code and perhaps use parallel processing if I can.

For each point, I want to find the closest polygon (most points are not in a polygon) and the distance of that polygon. The points are mostly observations at sea and the polygons are the coastal (U.S.) counties nearest the points.

I have been doing this using the gDistance function in the rgeos package and have been running a series of loops (I know, I know) in order to break up the processing by each of my individuals. I have spent a lot of time trying to figure out how to move this into the plyr/dplyr syntax somehow but can't quite get it. Part of my issue, I assume has to do with my object classes being SpatialPoint and SpatialPoylgonDataFrames.

library(sp)
library(rgeos)
library(plyr)
#  Create SpatialPointsDataFrame
#  My actual dataset has 24 million observations
my.pts <- data.frame(LONGITUDE=c(-85.4,-84.7,-82.7,-82.7,-86.5,-88.9,-94.8,-83.9,-87.8,-82.8),
             LATITUDE=c(30.0,29.9,27.5,28.5,30.4,26.1,29.3,28.0,29.4,27.8),
             MYID=c(1,1,2,2,2,2,3,4,4,4),
             INDEX=1:10)
coordinates(my.pts) <- c("LONGITUDE","LATITUDE")

#  Create two polygons in a SpatialPolygonsDataFrame
#  My actual dataset has 71 polygons (U.S. counties)
x1 <- data.frame(x=c(-92.3, -92.3, -90.7, -90.7, -92.3, -92.3),y=c(27.6, 29.4, 29.4, 27.6, 27.6, 27.6))
x1 <- as.data.frame(x1) 
x1 <- Polygon(rbind(x1,x1[1,]))

x2 <- data.frame(x=c(-85.2, -85.2, -83.3, -83.2, -85.2, -85.2),y=c(26.4, 26.9, 26.9, 26.0, 26.4,     26.4))
x2 <- as.data.frame(x2) 
x2 <- Polygon(rbind(x2,x2[1,]))

poly1 <- Polygons(list(x1),"poly1")
poly2 <- Polygons(list(x2),"poly2")
myShp <- SpatialPolygons(list(poly1,poly2),1:2)
sdf <- data.frame(ID=c(1,2))
row.names(sdf) <- c("poly1","poly2")
 myShp <- SpatialPolygonsDataFrame(myShp,data=sdf)

   #  I have been outputting my results to a list. With this small sample, it's easy to just put everything into the object county.vec. But I worry that the 24 million x 71 object would not be feasible. The non-loop version shows the output I've been getting more easily.

    COUNTY.LIST <- list()
    county.vec <- gDistance(my.pts, myShp, byid=TRUE)
    COUNTY.LIST[[1]] = apply(county.vec, 2, min)
    COUNTY.LIST[[2]] = apply(county.vec, 2, which.min)
    COUNTY.LIST[[3]] = my.pts$INDEX

# I have been putting it into a loop so that county.vec gets dumped for each version of the loop.
# Seems like this could be done using dlply perhaps? And then I would have the power of parallel processing?
idx <- unique(my.pts$MYID)
COUNTY.LIST <- list()
for(i in 1:length(idx)){
    COUNTY.LIST[[i]] <- list()
    county.vec <- gDistance(my.pts[my.pts$MYID==idx[i],], myShp, byid=TRUE)
    COUNTY.LIST[[i]][[1]] = apply(county.vec, 2, min)
    COUNTY.LIST[[i]][[2]] = apply(county.vec, 2, which.min)
    COUNTY.LIST[[i]][[3]] = my.pts$MY[my.pts$MYID==idx[i]]
    rm(county.vec)
}

dlply(my.pts,.(MYID),gDistance(my.pts, myShp, byid=TRUE),.parallel=TRUE)
> dlply(my.pts,.(MYID),gDistance(my.pts, myShp, byid=TRUE))
Error in eval.quoted(.variables, data) : 
envir must be either NULL, a list, or an environment.

# I suspect this error is because my.pts is a SpatialPointsPolygon.  I also recognize that my function call probably isn't right, but first things first.

# I tried another way to reference the MYID field, more inline with treatment of S4 objects...
dlply(my.pts,my.pts@data$MYID,gDistance(my.pts, myShp, byid=TRUE),.parallel=TRUE)

# It yields the same error.

I'd be grateful for any suggestions folks might have.

Timtima answered 19/12, 2014 at 0:33 Comment(2)
Can dlply() handle a SpatialPointsDataFrame? I don't know.Kozak
I'm guessing that it can't but I shudder to make such an assertion on SO! People always come up with such clever responses though, I wouldn't be surprised if there was a workaround or just a much more appropriate way to go about it.Timtima
O
10

This is an old question, but maybe my simple method helps others.
It's using parallels. I'm writing a general example. It will not run the previous data question.

set.seed(1)
#Create the clusters
library(doParallel)
cl <- makeCluster(detectCores()) 
registerDoParallel(cl)
#Export the environment variables to each cluster
clusterExport(cl,ls())
#Load the library "rgeos" to each cluster
clusterEvalQ(cl, library(rgeos))
#Split the data
ID.Split<-clusterSplit(cl,unique(poly1$ID))
#Define a function that calculates the distance of one ID in relation to the poly2
a<-function(x) gDistance(spgeom1 = poly1[x,], spgeom2 = poly2, byid=TRUE)
#Run the function in each cluster
system.time(m<-clusterApply(cl, x=ID.Split, fun=a))
#Cluster close
stopCluster(cl)
#Merge the results
output<- do.call("cbind", m)

I hope this helps.

Obsess answered 11/8, 2015 at 6:56 Comment(7)
Error in checkForRemoteErrors(val) : 8 nodes produced errors; first error: object 'poly1' not foundArraign
poly1 and poly2, are the two polygons you want to calculate the distance. Are you sure you are replacing the variable names with yours?Obsess
Thank you for the reply. Yes, I applied your methodology to my code. Basically I need to "load" the objects to use on each of the m clusters. The problem is that I am using a very big spatial object. How did you overcome the issue?Arraign
If I where you... I load each poly do some big clustering. So you reduce the number of comparison you have to do. I mean, something like overlaying circles (benkeers.files.wordpress.com/2012/04/circles.jpg). Make sense? Then you just check which objects are inside the circles and only compare the those. Because you are overlaying, at the end you are going to compare all to all. Just and idea.Obsess
I already did the size reduction of the spatiallinesdataframe. I am doing a slightly different task than you which is find the closest line to each point. This is more memory and computing intensive because for each point, the function gDistance has to compute N distances (where N is the total number of lines in the spatiallinesdataframe) and then find the lines which minimize the distance to the pointArraign
Again, the only solution I can think is to limit the search area for each point to the line. So, first do a circle or radius of investigation and later the lines to the points inside the previous areas. So, what do you, is overlay the radius so you are comparing the points more than once... But it's less than everything to everything. Just need to use sp::over, rgeos::gIntersection or raster::intersect.Obsess
@Obsess thanks for this. Lost some code parallelizing gIntersection and this solution is way more efficient than my previous. Didn't even think to use clusterApply here. To return only the nearest spgeom I added d <- gDistance(spdf[x,], tiger, byid=T) a <- apply(d,1,function(x) order(x, decreasing=F)[1]) to your function a().Binge

© 2022 - 2024 — McMap. All rights reserved.