I have a dataset that contains 100 million points. I need to spatially join them to the area they intersect with. I'm using sf
and st_join
, but unsurprisingly when I try to perform this operation my machine runs out of memory.
To get round this problem I have split my data into more manageable sizes and loop through them to get result I'm after. However, I wanted to see if there is a cleverer way to achieve my end goal (getting an Area ID appended to my original data).
require("data.table")
require("dplyr")
require("sf")
DT <- data.table(X=runif(100000000, min=300000, max=500000),
Y=runif(100000000, min=200000, max=400000),
UID = 1:100000000,
Group = rep(1:50, each=100000000/50))
Polygon <- st_read("https://ons-inspire.esriuk.com/arcgis/services/Administrative_Boundaries/Regions_December_2019_Boundaries_EN_BFE/MapServer/WFSServer?request=GetCapabilities&service=WFS") %>%
select(-gml_id) %>%
st_cast(., "GEOMETRYCOLLECTION") %>%
st_collection_extract(., "POLYGON") %>%
group_by(across(1:(ncol(.)-1))) %>%
summarise() %>%
st_cast("MULTIPOLYGON") %>%
ungroup()
DT_Split <- split(DT, by=c("Group"))
DT_Joined <- data.table()
for (i in 1:length(DT_Split)){
DT_Join <- st_join(st_as_sf(DT_Split[[i]], coords = c("X","Y"),crs=27700), Polygon, join = st_intersects)
DT_Split_Join <- DT_Join %>% st_drop_geometry %>% as.data.table
DT_Joined <- rbindlist(list(DT_Joined, DT_Split_Join))
}
st_contains()
(orst_intersects()
)? On my machine testing with 10 million points takes about 45 seconds and uses ~10GB memory. So if you have access to a server with +100GB memory for 7.5 minutes you should be good to go. Of course this only gives you a list with one entry per polygon listing the points that are contained, but that seems to be sufficient for your description of the problem. – Miley