calculate distance between each pair of coordinates in wide dataframe
Asked Answered
P

2

5

I want to calculate the distance between two linked set of spatial coordinates (program and admin in my fake dataset). The data are in a wide format, so both pairs of coordinates are in the same row.

library(sp)
set.seed(1)
n <- 100
program.id <- seq(1, n)
c1 <- cbind(runif(n, -90, 90), runif(n, -180, 180))
c2 <- cbind(runif(n, -90, 90), runif(n, -180, 180))
dat <- data.frame(cbind(program.id, c1, c2))
names(dat) <- c("program.id", "program.lat", "program.long", "admin.lat", "admin.long")
head(dat)
#       program.id program.lat program.long  admin.lat  admin.long
# 1              1   -42.20844     55.70061 -41.848523   62.536404
# 2              2   -23.01770    -52.84898 -50.643849 -145.851172
# 3              3    13.11361    -82.70635   3.023431   -2.665397
# 4              4    73.47740    177.36626 -41.588893  -13.841337
# 5              5   -53.69725     48.05758 -57.389701  -44.922049
# 6              6    71.71014   -103.24507   3.343705  176.795719

I know how to create a matrix of distances among program or admin using the sp package:

ll <- c("program.lat", "program.long")
coords <- dat[ll]
dist <- apply(coords, 1, 
              function(eachPoint) spDistsN1(as.matrix(coords),
                                            eachPoint, longlat=TRUE))

But what I want to do is create a nx1 vector of distances (dist.km) between each pair of coordinates and add it to dat.

#       program.id program.lat program.long  admin.lat  admin.long  dist.km
# 1              1   -42.20844     55.70061 -41.848523   62.536404   567.35
# 2              2   -23.01770    -52.84898 -50.643849 -145.851172  8267.86
# ...

Any suggestions? I've spent a while going through old SO questions, but nothing seems quite right. Happy to be proven wrong.

Update

@Amit's solution works for my toy dataset:

apply(dat,1,function(x) spDistsN1(matrix(x[2:3],nrow=1),x[3:4],longlat=TRUE))

But I think I need to swap the order of the lat, long the order of the lat long columns so long comes before lat. From ?spDistsN1:

pts: A matrix of 2D points, first column x/longitude, second column y/latitude, or a SpatialPoints or SpatialPointsDataFrame object

Also, unless I've misunderstood the logic, I think Amit's solution should grab cols [2:3] and [4:5], not [2:3] and [3:4].

My challenge now is applying this to my actual data. I've reproduced a portion below.

library(sp)
dat <- structure(list(ID = 1:4, 
                      subcounty = c("a", "b", "c", "d"), 
                      pro.long = c(33.47627919, 31.73605491, 31.54073482, 31.51748984), 
                      pro.lat = c(2.73996953, 3.26530095, 3.21327597, 3.17784981), 
                      sub.long = c(33.47552, 31.78307, 31.53083, 31.53083), 
                      sub.lat = c(2.740362, 3.391209, 3.208736, 3.208736)), 
                 .Names = c("ID", "subcounty", "pro.long", "pro.lat", "sub.long", "sub.lat"),     
                 row.names = c(NA, 4L), class = "data.frame")
head(dat) 
#     ID subcounty pro.long  pro.lat sub.long  sub.lat
#  1   1         a 33.47628 2.739970 33.47552 2.740362
#  2   2         b 31.73605 3.265301 31.78307 3.391209
#  3   3         c 31.54073 3.213276 31.53083 3.208736
#  4   4         d 31.51749 3.177850 31.53083 3.208736
apply(dat, 1, function(x) spDistsN1(matrix(x[3:4], nrow=1),
                                    x[5:6],
                                    longlat=TRUE)) 

I get the error: Error in spDistsN1(matrix(x[3:4], nrow = 1), x[5:6], longlat = TRUE) : pts must be numeric

I'm confused because these columns are numeric:

> is.numeric(dat$pro.long)
[1] TRUE
> is.numeric(dat$pro.lat)
[1] TRUE
> is.numeric(dat$sub.long)
[1] TRUE
> is.numeric(dat$sub.lat)
[1] TRUE
Paradis answered 4/3, 2014 at 22:3 Comment(5)
have you tried: apply(dat,1,function(x) spDistsN1(matrix(x[2:3],nrow=1),x[3:4],longlat=TRUE)) ?Piatt
@amit, I had not. I figured the answer would probably involve one of the apply functions, but I did not know the correct specification of the matrix. This looks to be the solution. I'd be happy to accept it if you want to add an answer.Paradis
as long as it works and helpful - I'm happy. I don't care much for the reputation thingy, but thanks for offering.Piatt
I believe I need to change the order of my lat and long columns as per the sp help: "pts A matrix of 2D points, first column x/longitude, second column y/latitude".Paradis
@amit, I think you meant to grab columns 2:3 and 4:5, not 3:4. Right?Paradis
H
5

The problem you're having is thatapply(...) coerces the first argument to a matrix. By definition, a matrix must have all elements of the same data type. Since one of the columns in dat (dat$subcounty) is char, apply(...) coerces everything to char. In your test dataset, everything was numeric, so you didn't have this problem.

This should work:

dat$dist.km <- sapply(1:nrow(dat),function(i)
                spDistsN1(as.matrix(dat[i,3:4]),as.matrix(dat[i,5:6]),longlat=T))
Hebephrenia answered 5/3, 2014 at 8:35 Comment(2)
I came across this solution today since I had a similar situation. I like this idea. I wonder if we can make it work even better. I have a large data set like 2GB and tried this code with data.table. The processing has been actually going for some time. For each row, we are asking R to create two matrices and handle the calculation. I rather think that creating SPDF and handle the same job. At least for each row, we do not have to convert a DF to matrix. Any thought? I also wonder if there is another function handling the same job faster.Barbbarba
@jazzurro, I believe there is a faster solution using data.table and geosphere #36817923Sachet
S
4

There is a much faster solution using data.table and geosphere.

library(data.table)
library(geosphere)

setDT(dat)[ , dist_km := distGeo(matrix(c(pro.long, pro.lat), ncol = 2), 
                                  matrix(c(sub.long, sub.lat), ncol = 2))/1000] 

Benchmark:

library(sp)

jlhoward <- function(dat) { dat$dist.km <- sapply(1:nrow(dat),function(i)
                             spDistsN1(as.matrix(dat[i,3:4]),as.matrix(dat[i,5:6]),longlat=T)) }

rafa.pereira <- function(dat2) { setDT(dat2)[ , dist_km := distGeo(matrix(c(pro.long, pro.lat), ncol = 2), 
                                                                 matrix(c(sub.long, sub.lat), ncol = 2))/1000] }


> system.time( jlhoward(dat) )
   user  system elapsed 
   8.94    0.00    8.94 

> system.time( rafa.pereira(dat) )
   user  system elapsed 
   0.07    0.00    0.08 

Data

dat <- structure(list(ID = 1:4, 
                      subcounty = c("a", "b", "c", "d"), 
                      pro.long = c(33.47627919, 31.73605491, 31.54073482, 31.51748984), 
                      pro.lat = c(2.73996953, 3.26530095, 3.21327597, 3.17784981), 
                      sub.long = c(33.47552, 31.78307, 31.53083, 31.53083), 
                      sub.lat = c(2.740362, 3.391209, 3.208736, 3.208736)), 
                 .Names = c("ID", "subcounty", "pro.long", "pro.lat", "sub.long", "sub.lat"),     
                 row.names = c(NA, 4L), class = "data.frame")

# enlarge dataset to 40,000 pairs
dat <- dat[rep(seq_len(nrow(dat)), 10000), ]
Sachet answered 10/6, 2016 at 11:50 Comment(1)
Rafa, thanks for your message and your answer. Your solution is surely faster!Barbbarba

© 2022 - 2024 — McMap. All rights reserved.