Spatial join of 100 million points using R
Asked Answered
C

0

6

I have a dataset that contains 100 million points. I need to spatially join them to the area they intersect with. I'm using sf and st_join, but unsurprisingly when I try to perform this operation my machine runs out of memory.

To get round this problem I have split my data into more manageable sizes and loop through them to get result I'm after. However, I wanted to see if there is a cleverer way to achieve my end goal (getting an Area ID appended to my original data).

require("data.table")
require("dplyr")
require("sf")

DT <- data.table(X=runif(100000000, min=300000, max=500000),
                 Y=runif(100000000, min=200000, max=400000),
                 UID = 1:100000000,
                 Group = rep(1:50, each=100000000/50))
                             
Polygon <- st_read("https://ons-inspire.esriuk.com/arcgis/services/Administrative_Boundaries/Regions_December_2019_Boundaries_EN_BFE/MapServer/WFSServer?request=GetCapabilities&service=WFS") %>%
        select(-gml_id) %>%
        st_cast(., "GEOMETRYCOLLECTION") %>%
        st_collection_extract(., "POLYGON") %>% 
        group_by(across(1:(ncol(.)-1))) %>%
        summarise() %>%
        st_cast("MULTIPOLYGON") %>%
        ungroup()

DT_Split <- split(DT, by=c("Group"))

DT_Joined <- data.table()
    for (i in 1:length(DT_Split)){
    DT_Join <- st_join(st_as_sf(DT_Split[[i]], coords = c("X","Y"),crs=27700), Polygon, join = st_intersects)
    DT_Split_Join <- DT_Join %>% st_drop_geometry %>% as.data.table
    DT_Joined <- rbindlist(list(DT_Joined, DT_Split_Join))
    }
Catboat answered 24/8, 2021 at 13:36 Comment(2)
If this is a one off problem you are likely set as you are; if this is a recurring problem you might be better off with offloading the problem to a PostGIS database - in doing so transform it from a memory constrained to disk space constrained.Escarole
Do you need the join here? Wouldn't it be faster to just run st_contains() (or st_intersects())? On my machine testing with 10 million points takes about 45 seconds and uses ~10GB memory. So if you have access to a server with +100GB memory for 7.5 minutes you should be good to go. Of course this only gives you a list with one entry per polygon listing the points that are contained, but that seems to be sufficient for your description of the problem.Miley

© 2022 - 2025 — McMap. All rights reserved.