R: Faster way of computing large distance matrix
Asked Answered
L

2

2

I am computing distance matrix between large number of locations (5000) on sphere (using Haversine distance function).

Here is my code:

require(geosphere)
x=rnorm(5000)
y=rnorm(5000)
xy1=cbind(x,y)

The time taken for computing the distance matrix is

 system.time( outer(1:nrow(xy1), 1:nrow(xy1), function(i,j) distHaversine(xy1[i,1:2],xy1[j,1:2])))

The time taken to execute this program is high. Any suggestion how to lower time consumption to do this job! Thanks.

Lipolysis answered 24/5, 2016 at 7:30 Comment(3)
you can try an alternative implementation. see r-bloggers.com/…Smilacaceous
@Leo in good conscience and without meaning to offend I have to point out that the linked article is terrible! The author uses a for loop to cycle through a vector to repeatedly call a function (distHaversine()) which is already vectorised!! They wrote more code whilst also slowing the speed of execution by about 300X!!! Do not heed this article! You don't call a function 10,000 times when once will do!Banka
Hi @SimonO'Hanlon , thanks for the heads up. :-)Smilacaceous
I
3

Try the built-in function in the geosphere package?

z <- distm( xy1 )

The default distance function for distm() - which calculates a distance matrix between a set of points - is the Haversine ("distHaversine") formula, but you may specify another using the fun argument.

On my 2.6GHz Core i7 rMBP this takes about 5 seconds for 5,000 points.

Ivo answered 24/5, 2016 at 7:42 Comment(1)
Thanks for your suggestion. This execution is taking less than half time than my code.Lipolysis
G
2

I add below a solution using the spatialrisk package. The key functions in this package are written in C++ (Rcpp), and are therefore very fast.

library(geosphere)
library(spatialrisk)
library(data.table)

x=rnorm(5000)
y=rnorm(5000)
xy1 = data.table(x,y)

# Cross join two data tables
coordinates_dt <- optiRum::CJ.dt(xy1, xy1)

system.time({
  z <- distm( xy1 )
})
# user  system elapsed 
# 14.163   3.700  19.072 

system.time({
  distances_m <- coordinates_dt[, dist_m := spatialrisk::haversine(y, x, i.y, i.x)]
})
# user  system elapsed 
# 2.027   0.848   2.913 
Generalship answered 11/11, 2019 at 11:27 Comment(3)
Can confirm, this is a very fast solution.Trudy
What about for a distance matrix of 100 million points?Fortson
If that happens, a data.frame with 1e16 rows will be generated, which is commonly referred to as 10 quadrillion:-)Generalship

© 2022 - 2024 — McMap. All rights reserved.