R - Finding closest neighboring point and number of neighbors within a given radius, coordinates lat-long
Asked Answered
A

6

41

I am trying to figure out how isolated certain points are within my data set. I am using two methods to determine isolation, the distance of the closest neighbor and the number of neighboring sites within a given radius. All my coordinates are in latitude and longitude

This is what my data looks like:

    pond            lat         long        area    canopy  avg.depth   neighbor    n.lat   n.long  n.distance  n.area  n.canopy    n.depth n.avg.depth radius1500
    A10             41.95928    -72.14605   1500    66      60.61538462                                 
    AA006           41.96431    -72.121     250     0       57.77777778                                 
    Blacksmith      41.95508    -72.123803  361     77      71.3125                                 
    Borrow.Pit.1    41.95601    -72.15419   0       0       41.44444444                                 
    Borrow.Pit.2    41.95571    -72.15413   0       0       37.7                                    
    Borrow.Pit.3    41.95546    -72.15375   0       0       29.22222222                                 
    Boulder         41.918223   -72.14978   1392    98      43.53333333                                 

I want to put the name of the nearest neighboring pond in the column neighbor, its lat and long in n.lat and n.long, the distance between the two ponds in n.distance, and the area, canopy and avg.depth in each of the appropriate columns.

Second, I want to put the number of ponds within 1500m of the target pond into radius1500.

Does anyone know of a function or package that will help me calculate the distances/numbers that I want? If it's an issue, it won't be hard to enter the other data I need, but the nearest neighbor's name and distance, plus the number of ponds within 1500m is what I really need help with.

Thank you.

Alliance answered 24/2, 2014 at 2:11 Comment(0)
L
50

Best option is to use libraries sp and rgeos, which enable you to construct spatial classes and perform geoprocessing.

library(sp)
library(rgeos)

Read the data and transform them to spatial objects:

mydata <- read.delim('d:/temp/testfile.txt', header=T)

sp.mydata <- mydata
coordinates(sp.mydata) <- ~long+lat

class(sp.mydata)
[1] "SpatialPointsDataFrame"
attr(,"package")
[1] "sp"

Now calculate pairwise distances between points

d <- gDistance(sp.mydata, byid=T)

Find second shortest distance (closest distance is of point to itself, therefore use second shortest)

min.d <- apply(d, 1, function(x) order(x, decreasing=F)[2])

Construct new data frame with desired variables

newdata <- cbind(mydata, mydata[min.d,], apply(d, 1, function(x) sort(x, decreasing=F)[2]))

colnames(newdata) <- c(colnames(mydata), 'neighbor', 'n.lat', 'n.long', 'n.area', 'n.canopy', 'n.avg.depth', 'distance')

newdata
            pond      lat      long area canopy avg.depth     neighbor    n.lat    n.long n.area n.canopy n.avg.depth
6            A10 41.95928 -72.14605 1500     66  60.61538 Borrow.Pit.3 41.95546 -72.15375      0        0    29.22222
3          AA006 41.96431 -72.12100  250      0  57.77778   Blacksmith 41.95508 -72.12380    361       77    71.31250
2     Blacksmith 41.95508 -72.12380  361     77  71.31250        AA006 41.96431 -72.12100    250        0    57.77778
5   Borrow.Pit.1 41.95601 -72.15419    0      0  41.44444 Borrow.Pit.2 41.95571 -72.15413      0        0    37.70000
4   Borrow.Pit.2 41.95571 -72.15413    0      0  37.70000 Borrow.Pit.1 41.95601 -72.15419      0        0    41.44444
5.1 Borrow.Pit.3 41.95546 -72.15375    0      0  29.22222 Borrow.Pit.2 41.95571 -72.15413      0        0    37.70000
6.1      Boulder 41.91822 -72.14978 1392     98  43.53333 Borrow.Pit.3 41.95546 -72.15375      0        0    29.22222
        distance
6   0.0085954872
3   0.0096462277
2   0.0096462277
5   0.0003059412
4   0.0003059412
5.1 0.0004548626
6.1 0.0374480316

Edit: if coordinates are in degrees and you would like to calculate distance in kilometers, use package geosphere

library(geosphere)

d <- distm(sp.mydata)

# rest is the same

This should provide better results, if the points are scattered across the globe and coordinates are in degrees

Livialivid answered 24/2, 2014 at 7:42 Comment(11)
Thank you very much. The libraries you suggested are exactly what I needed!Alliance
This is some very informative and readable code, thanks! I am, however, unable to tweak it to my slightly different usecase: I need to find the closest points between two different datasets (I have a dataset of tweets, and I need the closest city to each tweet). What should I change?Bud
function gDistance can take two parameters - in you case tweets and cities. The resulting matrix should have all pairwise distances between the two (but beware that if you have thousands of points, this might be too much for the pc)Livialivid
Anyone know of faster way than calculating all pairwise distances using a spatial index of some sort?Byre
first: this was insanely helpful, thank you. second, is there a way to make it only find non 0.000000 distances? I'm using it to find nearest households, but it breaks when people live together, so for example, it skips the first record by using the decreasing but could you do it by something where it's not the same coordinate set. @Zbynek, thanks for any additional help you could give.Sedgewick
my data set looks like: id printid areaname latitude longitude 1 7912048 233502729 073 36.06241 -80.44229 2 735253 171241999 area 12-06 35.54452 -78.75388 3 4325564 85564887 area 12-04 35.49328 -78.73756 4 4222241 85461255 area 12-06 35.53621 -78.75553 5 11997754 356053648 area 12-04 35.49328 -78.73756 (sorry for the terrible formatting, new line doesn't seem to happen here so it looks a mess, the " 1 2 3 4 5" each indicate a new line.Sedgewick
use following as the function: sort(x[x>0], decreasing=F)[1]Livialivid
@Zbynek; brilliant! If it's not too much trouble, can you tell us why the function works? I understand how the gDistance() function works, creating all the various distances but I'm not sure why the order/decreasing works.Sedgewick
@ike; sort orders values in row/column from smallest to highest or vice versa. but since you want to omit zero distances, first you must filter the data - x[x>0]. then you sort them and finally you just take the first value in sorted array ([1]). all clear?Livialivid
The distance is calculated in km ?Dyane
@NicoCoallier it depends on units of coordinate system (for WGS, as in example, it is just a number, it does not use haversine distance)Livialivid
A
16

I add below an alternative solution using the newer sf package, for those interested and coming to this page now (as I did).

First, load the data and create the sf object.

# Using sf
mydata <- structure(
  list(pond = c("A10", "AA006", "Blacksmith", "Borrow.Pit.1", 
                "Borrow.Pit.2", "Borrow.Pit.3", "Boulder"), 
       lat = c(41.95928, 41.96431, 41.95508, 41.95601, 41.95571, 41.95546, 
               41.918223), 
       long = c(-72.14605, -72.121, -72.123803, -72.15419, -72.15413, 
                -72.15375, -72.14978), 
       area = c(1500L, 250L, 361L, 0L, 0L, 0L, 1392L), 
       canopy = c(66L, 0L, 77L, 0L, 0L, 0L, 98L), 
       avg.depth = c(60.61538462, 57.77777778, 71.3125, 41.44444444, 
                     37.7, 29.22222222, 43.53333333)), 
  class = "data.frame", row.names = c(NA, -7L))


library(sf)
data_sf <- st_as_sf(mydata, coords = c("long", "lat"),
                    # Change to your CRS
                    crs = "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs")
st_is_longlat(data_sf)

sf::st_distance calculates the distance matrix in meters using Great Circle distance when using lat/lon data.

dist.mat <- st_distance(data_sf) # Great Circle distance since in lat/lon
# Number within 1.5km: Subtract 1 to exclude the point itself
num.1500 <- apply(dist.mat, 1, function(x) {
  sum(x < 1500) - 1
})

# Calculate nearest distance
nn.dist <- apply(dist.mat, 1, function(x) {
  return(sort(x, partial = 2)[2])
})
# Get index for nearest distance
nn.index <- apply(dist.mat, 1, function(x) { order(x, decreasing=F)[2] })

n.data <- mydata
colnames(n.data)[1] <- "neighbor"
colnames(n.data)[2:ncol(n.data)] <- 
  paste0("n.", colnames(n.data)[2:ncol(n.data)])
mydata2 <- data.frame(mydata,
                      n.data[nn.index, ],
                      n.distance = nn.dist,
                      radius1500 = num.1500)
rownames(mydata2) <- seq(nrow(mydata2))
mydata2
          pond      lat      long area canopy avg.depth     neighbor    n.lat    n.long n.area n.canopy
1          A10 41.95928 -72.14605 1500     66  60.61538 Borrow.Pit.1 41.95601 -72.15419      0        0
2        AA006 41.96431 -72.12100  250      0  57.77778   Blacksmith 41.95508 -72.12380    361       77
3   Blacksmith 41.95508 -72.12380  361     77  71.31250        AA006 41.96431 -72.12100    250        0
4 Borrow.Pit.1 41.95601 -72.15419    0      0  41.44444 Borrow.Pit.2 41.95571 -72.15413      0        0
5 Borrow.Pit.2 41.95571 -72.15413    0      0  37.70000 Borrow.Pit.1 41.95601 -72.15419      0        0
6 Borrow.Pit.3 41.95546 -72.15375    0      0  29.22222 Borrow.Pit.2 41.95571 -72.15413      0        0
7      Boulder 41.91822 -72.14978 1392     98  43.53333 Borrow.Pit.3 41.95546 -72.15375      0        0
  n.avg.depth n.distance radius1500
1    41.44444  766.38426          3
2    71.31250 1051.20527          1
3    57.77778 1051.20527          1
4    37.70000   33.69099          3
5    41.44444   33.69099          3
6    37.70000   41.99576          3
7    29.22222 4149.07406          0

For obtaining the closest neighbor after calculating distance, you can use sort() with the partial = 2 argument. Depending on the amount of data, this could be much faster than using order as in the previous solution. The package Rfast is likely even faster but I avoid including additional packages here. See this related post for a discussion and benchmarking of various solutions: https://mcmap.net/q/128031/-fastest-way-to-find-second-third-highest-lowest-value-in-vector-or-column

Argonaut answered 23/10, 2019 at 21:54 Comment(0)
P
3

I add below a solution using the spatialrisk package. The key functions in this package are written in C++ (Rcpp), and are therefore very fast.

First, load the data:

df <- data.frame(pond = c("A10", "AA006", "Blacksmith", "Borrow.Pit.1", 
                          "Borrow.Pit.2", "Borrow.Pit.3", "Boulder"), 
                 lat = c(41.95928, 41.96431, 41.95508, 41.95601, 
                         41.95571, 41.95546, 41.918223), 
                 long = c(-72.14605, -72.121, -72.123803, -72.15419, 
                          -72.15413, -72.15375, -72.14978), 
                 area = c(1500, 250, 361, 0, 0, 0, 1392), 
                 canopy = c(66, 0, 77, 0, 0, 0, 98), 
                 avg.depth = c(60.61538462, 57.77777778, 71.3125, 41.44444444,
                               37.7, 29.22222222, 43.53333333))

The function spatialrisk::points_in_circle calculates the observations within radius from a center point. Note that distances are calculated using the Haversine formula. Since each element of the output is a data frame, purrr::map_dfr is used to row-bind them together:

ans1 <- purrr::map2_dfr(df$long, df$lat, 
                        ~spatialrisk::points_in_circle(df, .x, .y, 
                                                       lon = long, 
                                                       radius = 100000)[2,])

colnames(ans1) <- c("neighbor", "n.lat", "n.long", "n.area", 
                    "n.canopy", "n.avg.depth", "distance_m")

      neighbor    n.lat    n.long n.area n.canopy n.avg.depth distance_m
1 Borrow.Pit.1 41.95601 -72.15419      0        0    41.44444  765.87823
2   Blacksmith 41.95508 -72.12380    361       77    71.31250 1053.35200
3        AA006 41.96431 -72.12100    250        0    57.77778 1053.35200
4 Borrow.Pit.2 41.95571 -72.15413      0        0    37.70000   33.76321
5 Borrow.Pit.1 41.95601 -72.15419      0        0    41.44444   33.76321
6 Borrow.Pit.2 41.95571 -72.15413      0        0    37.70000   42.00128
7 Borrow.Pit.3 41.95546 -72.15375      0        0    29.22222 4158.21978

Now calculate the number of ponds within 1500m of the target pond. The function spatialrisk::concentration sums the number of observations within a radius from center points. 1 is substracted from the number of ponds to exclude the pond itself.

df$npond <- 1  
radius1500 <- spatialrisk::concentration(df, df, npond, lon_sub = long, 
                                         lon_full = long, radius = 1500, 
                                         display_progress = FALSE)$concentration - 1

Column-bind the data frames together:

cbind(df, ans1, radius1500)

          pond      lat      long area canopy avg.depth     neighbor    n.lat    n.long n.area n.canopy n.avg.depth distance_m radius1500
1          A10 41.95928 -72.14605 1500     66  60.61538 Borrow.Pit.1 41.95601 -72.15419      0        0    41.44444  765.87823          3
2        AA006 41.96431 -72.12100  250      0  57.77778   Blacksmith 41.95508 -72.12380    361       77    71.31250 1053.35200          1
3   Blacksmith 41.95508 -72.12380  361     77  71.31250        AA006 41.96431 -72.12100    250        0    57.77778 1053.35200          1
4 Borrow.Pit.1 41.95601 -72.15419    0      0  41.44444 Borrow.Pit.2 41.95571 -72.15413      0        0    37.70000   33.76321          3
5 Borrow.Pit.2 41.95571 -72.15413    0      0  37.70000 Borrow.Pit.1 41.95601 -72.15419      0        0    41.44444   33.76321          3
6 Borrow.Pit.3 41.95546 -72.15375    0      0  29.22222 Borrow.Pit.2 41.95571 -72.15413      0        0    37.70000   42.00128          3
7      Boulder 41.91822 -72.14978 1392     98  43.53333 Borrow.Pit.3 41.95546 -72.15375      0        0    29.22222 4158.21978          0
Purvey answered 23/10, 2019 at 21:26 Comment(0)
E
3

Another answer which while probably slower might have an intuitive appeal for dplyr addicts.

You create a mega-grid of every single possible combination of lat/lons, then you can find the one that has the smallest distance using geosphere.

The example is where you have two datasets with different points to compare - but you can adjust it easily by duplicating the first dataset.

    library(tidyverse)
    library(geosphere)
    library(data.table)

    #This function creates a big dataframe with every possible combination
    expand.grid.df <- function(...) Reduce(function(...) merge(..., by=NULL), list(...))

shortest_distance <- expand.grid.df(df1,df2) %>%
      mutate(distance = distHaversine(p1 = cbind(lon_2,lat_2),
                                      p2 = cbind(lon,lat))) %>% 
      group_by(ACCIDENT_NO) %>% 
      slice(which.min(distance))
Esquimau answered 31/12, 2019 at 3:16 Comment(0)
B
2

In Rfast, there is a function called "dista" and computes the Euclidean or the Manhattan distances only (at the moment). It gives the option to compute the k-smallest distances. Alternatively it can return the indexes of the observations with the smallest distances. The cosinus distance is basically almost the same as the Eucledean distance (exluding a constant, 2 I think).

Bickel answered 13/11, 2019 at 7:53 Comment(1)
Any chance of an example for this?Entoderm
D
1

The Solution propose by @Zbynek is quite nice but if you are looking for a distance between two neighboor in km like I am , I am proposing this solution.

   earth.dist<-function(lat1,long1,lat2,long2){

           rad <- pi/180
           a1 <- lat1 * rad
           a2 <- long1 * rad
           b1 <- lat2 * rad
           b2 <- long2 * rad
           dlat <- b1-a1
           dlon<- b2-a2
           a <- (sin(dlat/2))^2 +cos(a1)*cos(b1)*(sin(dlon/2))^2
           c <- 2*atan2(sqrt(a),sqrt(1-a))
           R <- 6378.145
           dist <- R *c
           return(dist)
           }


    Dist <- matrix(0,ncol=length(mydata),nrow=length(mydata.sp))

  for (i in 1:length(mydata)){
      for(j in 1:length(mydata.sp)){
          Dist[i,j] <- earth.dist(mydata$lat[i],mydata$long[i],mydata.sp$lat[j],mydata.sp$long[j])
 }}



     DDD <- matrix(0, ncol=5,nrow=ncol(Dist))   ### RECTIFY the nb of col by the number of variable you want

   for(i in 1:ncol(Dist)){
       sub<- sort(Dist[,i])[2]
       DDD[i,1] <- names(sub) 
       DDD[i,2] <- sub
       DDD[i,3] <- rownames(Dist)[i]
       sub_neig_atr <- Coord[Coord$ID==names(sub),]
       DDD[i,4] <- sub_neig_atr$area
       DDD[i,5] <- sub_neig_atr$canopy
       ### Your can add any variable you want here 

   }

    DDD <- as.data.frame(DDD)

    names(DDD)<-c("neigboor_ID","distance","pond","n.area","n.canopy")
   data <- merge(mydata,DDD, by="pond")

You end up getting a distance in km if your coordinates are long and lat.

Any suggestions to make it better ?

Dyane answered 12/4, 2017 at 19:58 Comment(3)
No need to wtire it on your own, there is already geosphere package - cran.r-project.org/web/packages/geosphere/geosphere.pdfLivialivid
Which function in that package would calculate an euclidian distance in kilometers ?Dyane
I think distm and you can choose exact formula - default is Haversine, but there are more options (see the manual)Livialivid

© 2022 - 2024 — McMap. All rights reserved.