gIntersection on very large spatial objects

Asked 4/6, 2013 at 13:11 Answered 27/3, 2019 at 17:58

long-time reader, first time poster.

I'm attempting perform a gIntersection() on two very large SpatialPolygonsDataFrame objects. The first is all US Counties, the second is a 240 row x 279 column grid, as a series of 66,960 polygon.

I successfully ran this by just using Pennsylvania and the piece of the grid that overlaps PA:

gIntersection(PA, grid, byid=TRUE)

I tried to run this overnight for the whole U.S. and it was still running this morning with a 10 GB(!) swap file on my hard drive and no evidence of progress. Am I doing something wrong, or is this normal behavior, and I should just do a state-by-state loop?

Thanks!

Find answered 4/6, 2013 at 13:11 Comment(5)

You want to know which states overlap which polygons? – Faun 4/6, 2013 at 13:14

This operation is part of a calculation to get population in each grid square. I was thinking I would do the intersection to get a set of small polygons, get the % of each county that each polygon represents, and the multiply population by % of county to get population in each grid square. I'm looking into over(), since that may do the trick as well. – Find 4/6, 2013 at 14:39

it looks like over won't do what I'm asking, since it can't do calculations weighted by area of overlap. I've got ArcGIS running the intersect now, and I'll start coding the intersection loop in R, and we'll see who wins! – Find 4/6, 2013 at 15:58

If anybody is interested, ArcGIS finished the intersection in about 2 minutes. – Find 4/6, 2013 at 16:47

I look forward to the R version. :) It's definitely better to do it on a per object basis (i*j) in my experience, in terms of speed but also in terms of topology errors (I am still trying to track down something that doesn't work when intersecting whole layers). You might find that ArcGIS does the job by generalizing first (almost certainly it would do some kind of high level sort first to discard simple non-overlaps). I'm keen to compare with Manifold as well if you can share the data or recreate. – Aeromedical 4/6, 2013 at 20:15

A little later than I hoped, but here's the function I ended up using for my task related to this. It could probably be adapted to other applications.

@mdsumner was right that a high-level operation to discard non-intersects sped this up greatly. Hopefully this is useful!

library("sp")
library("rgeos")
library("plyr")

ApportionPopulation <- function(AdminBounds, poly, Admindf) { # I originally wrote this function to total the population that lies within each polygon in a SpatialPolygon object. AdminBounds is a SpatialPolygon for whatever administrative area you're working with; poly is the SpatalPolygon you want to total population (or whatever variable of your choice) across, and Admindf is a dataframe that has data for each polygon inside the AdminBounds SpatialPolygon.

  # the AdminBounds have the administrative ID code as feature IDS. I set that up using spChFID()

  # start by trimming out areas that don't intersect

  AdminBounds.sub <- gIntersects(AdminBounds, poly, byid=TRUE) # test for areas that don't intersect
  AdminBounds.sub2 <- apply(AdminBounds.sub, 2, function(x) {sum(x)}) # test across all polygons in the SpatialPolygon whether it intersects or not
  AdminBounds.sub3 <- AdminBounds[AdminBounds.sub2 > 0] # keep only the ones that actually intersect

  # perform the intersection. This takes a while since it also calculates area and other things, which is why we trimmed out irrelevant areas first

  int <- gIntersection(AdminBounds.sub3, poly, byid=TRUE) # intersect the polygon and your administrative boundaries

  intdf <- data.frame(intname=names(int)) # make a data frame for the intersected SpatialPolygon, using names from the output list from int
  intdf$intname <- as.character(intdf$intname) # convert the name to character
  splitid <- strsplit(intdf$intname, " ", fixed=TRUE) # split the names
  splitid <- do.call("rbind", splitid) # rbind those back together
  colnames(splitid) <- c("adminID", "donutshpid") # now you have the administrative area ID and the polygonID as separate variables in a dataframe that correspond to the int SpatialPolygon.
  intdf <- data.frame(intdf, splitid) # make that into a dataframe
  intdf$adminID <- as.character(intdf$adminID) # convert to character
  intdf$donutshpid <- as.character(intdf$donutshpid) # convert to character. In my application the shape I'm using is a series of half-circles

  # now you have a dataframe corresponding to the intersected SpatialPolygon object

  intdf$polyarea <- sapply(int@polygons, function(x) {x@area}) # get area from the polygon SP object and put it in the df
  intdf2 <- join(intdf, Admindf, by="adminID") # join together the two dataframes by the administrative ID
  intdf2$popinpoly <- intdf2$pop * (intdf2$polyarea / intdf2$admin_area) # calculate the proportion of the population in the intersected area that is within the bounds of the polygon (assuming the population is evenly distributed within the administrative area)
  intpop <- ddply(intdf2, .(donutshpid), summarize, popinpoly=sum(popinpoly)) # sum population lying within each polygon

  # maybe do other final processing to get the output in the form you want

  return(intpop) # done!
}

Find answered 17/7, 2013 at 15:25 Comment(0)

I found the sf package is superior for this:

out <- st_intersection(grid, polygons)

gIntersection was locking up my computer for hours trying to run and as a result requires trimming or cycling through individual polygons, st_intersection from sf package runs my data in seconds.

st_intersection also automatically merges the dataframes of both inputs.

Thanks to Grant Williamson at University of Tasmania for the vignette: https://atriplex.info/blog/index.php/2017/05/24/polygon-intersection-and-summary-with-sf/

Pep answered 27/3, 2019 at 17:58 Comment(1)

I just found the same. I just ran a nearly identical operation (same data source and everything) – gIntersection took 18 hrs and refused to rbind() or bind_rows() to the output. st_intersection() did it in about a minute. the sf package is a work of art. – Find 14/6, 2019 at 14:21

You could probably get your answer faster using rasterize in the raster package, with your grid as a raster. It has an argument for finding the amount of polygon overlap to a cell.

 ?rasterize
 getCover: logical. If ‘TRUE’, the fraction of each grid cell that is
      covered by the polygons is returned (and the values of
      ‘field, fun, mask’, and ‘update’ are ignored. The fraction
      covered is estimated by dividing each cell into 100 subcells
      and determining presence/absence of the polygon in the center
      of each subcell

It doesn't look like you get to control the number of subcells, though that probably wouldn't be hard to open up.

Aeromedical answered 4/6, 2013 at 20:22 Comment(3)

Thanks! That looks like it would work if I was able to actually rasterize this dataset. The grid dataset here is not quite a perfect grid (it's output from an atmospheric chemistry model that does things volumetrically, so they're not quite perfect squares). The grid may have to stay polygons, but I'm not sure. I actually ended up generating this grid from a set of points using Voronoi polygons. As is probably typical for a first question, important information keeps trickling out... – Find 4/6, 2013 at 21:4

Ok, that's different then. Maybe it's enough to approximate it by creating a point grid (in equal area projection) and overlaying that with both the "grid" and the polygons, then tabulate the points by polygon/"grid" to give an area proxy. – Aeromedical 4/6, 2013 at 22:8

If the grid were smaller it would work, but the geographical area is so large that the spherical geometry starts getting in the way. On your other point, In a week or 2, I will have to do a similar operation that I absolutely can't do in ArcGIS since I need to loop it. When I work out my solution, I'll post it! – Find 5/6, 2013 at 15:28

Recommended topics

Hot tags