Generating the whole distance matrix at a time will be very RAM consuming, looping over each combination of unique zipcodes - very time consuming. Lets find some compromise.
I suggest chunking the zipcode
data.frame
into pieces of (for example) 100 rows (with the help of chunk
function from package bit
), then calculating distances between 44336 and 100 points, filtering according to the target distance treshold and then moving on to the next data chunk. In my example I convert zipcode
data into data.table
to gain some speed and save RAM.
library(zipcode)
library(data.table)
library(magrittr)
library(geosphere)
data(zipcode)
setDT(zipcode)
zipcode[, dum := NA] # we'll need it for full outer join
Just for information - that's the approximate size of each piece of data in RAM.
merge(zipcode, zipcode[1:100], by = "dum", allow.cartesian = T) %>%
object.size() %>% print(unit = "Mb")
# 358.2 Mb
The code itself.
lapply(bit::chunk(1, nrow(zipcode), 1e2), function(ridx) {
merge(zipcode, zipcode[ridx[1]:ridx[2]], by = "dum", allow.cartesian = T)[
, dist := distGeo(matrix(c(longitude.x, latitude.x), ncol = 2),
matrix(c(longitude.y, latitude.y), ncol = 2))/1609.34 # meters to miles
][dist <= 5 # necessary distance treshold
][, dum := NULL]
}) %>% rbindlist -> zip_nearby_dt
zip_nearby_dt # not the whole! for first 10 chunks only
zip.x city.x state.x latitude.x longitude.x zip.y city.y state.y latitude.y longitude.y dist
1: 00210 Portsmouth NH 43.00590 -71.01320 00210 Portsmouth NH 43.00590 -71.01320 0.000000
2: 00210 Portsmouth NH 43.00590 -71.01320 00211 Portsmouth NH 43.00590 -71.01320 0.000000
3: 00210 Portsmouth NH 43.00590 -71.01320 00212 Portsmouth NH 43.00590 -71.01320 0.000000
4: 00210 Portsmouth NH 43.00590 -71.01320 00213 Portsmouth NH 43.00590 -71.01320 0.000000
5: 00210 Portsmouth NH 43.00590 -71.01320 00214 Portsmouth NH 43.00590 -71.01320 0.000000
---
15252: 02906 Providence RI 41.83635 -71.39427 02771 Seekonk MA 41.84345 -71.32343 3.688747
15253: 02912 Providence RI 41.82674 -71.39770 02771 Seekonk MA 41.84345 -71.32343 4.003095
15254: 02914 East Providence RI 41.81240 -71.36834 02771 Seekonk MA 41.84345 -71.32343 3.156966
15255: 02916 Rumford RI 41.84325 -71.35391 02769 Rehoboth MA 41.83507 -71.26115 4.820599
15256: 02916 Rumford RI 41.84325 -71.35391 02771 Seekonk MA 41.84345 -71.32343 1.573050
On my machine it took 1.7 minutes to process 10 chunks, so the whole processing may take 70-80 minutes, not fast, but may be satisfying. We can increase the chunk size to 200 or 300 rows depending on available RAM volume, this will shorten the processing time 2 or 3 times respectively.
The drawback of this solution is that the resulting data.table
contains "duplicated" rows - I mean there are both distances from point A to point B, and from B to A. This may need some additional filtering.
data.table
s should help you do that filtering very efficiently. – Bekelja