Matching data frames based on shortest geographic distance
Asked Answered
I

2

6

I have two dataframes, both of which contain latitude and longitude coordinates. The first dataframe is observations of events, where the location and time was recorded. The second dataframe is geographic features, where the location and info about the feature is recorded.

my_df_1 <- structure(list(START_LAT = c(-33.15, -35.6, -34.08333, -34.13333, 
-34.31667, -47.38333, -47.53333, -34.08333, -47.38333, -47.15
), START_LONG = c(163, 165.18333, 162.88333, 162.58333, 162.76667, 
148.98333, 148.66667, 162.9, 148.98333, 148.71667)), row.names = c(1175L, 
528L, 1328L, 870L, 672L, 707L, 506L, 981L, 756L, 210L), class = "data.frame", .Names = c("START_LAT", 
"START_LONG"))

my_df_2 <- structure(list(latitude = c(-42.7984, -34.195, -49.81, -35.417, 
-28.1487, -44.657, -42.7898, -36.245, -39.1335, -31.8482), longitude = c(179.9874, 
179.526, -176.68, 178.765, -168.0314, 174.695, -179.9873, 177.7873, 
-170.0583, 173.2424), depth_top = c(935L, 2204L, 869L, 1973L, 
4750L, 555L, 894L, 1500L, 4299L, 1303L)), row.names = c(580L, 
1306L, 926L, 1102L, 60L, 1481L, 574L, 454L, 1168L, 144L), class = "data.frame", .Names = c("latitude", 
"longitude", "depth_top"))

What I need to do, is for every observation in df1, I need to find out which feature in df2 is geographically closest. Ideally, I'd get a new column appended to df1 which every row is the closest feature from df2.

I worked through this question How to assign several names to lat-lon observations, but was unable to figure out how to match it to my data

The real dataframes have 1000s of rows, which is why I cant do this by hand

Intermediate answered 10/8, 2017 at 23:52 Comment(0)
N
4

A solution using st_distance from the sf package. my_df_final is the final output.

# Load packages
library(tidyverse)
library(sp)
library(sf)

# Create ID for my_df_1 and my_df_2 based on row id
# This step is not required, just help me to better distinguish each point
my_df_1 <- my_df_1 %>% mutate(ID1 = row.names(.))
my_df_2 <- my_df_2 %>% mutate(ID2 = row.names(.))

# Create spatial point data frame
my_df_1_sp <- my_df_1
coordinates(my_df_1_sp) <- ~START_LONG + START_LAT

my_df_2_sp <- my_df_2
coordinates(my_df_2_sp) <- ~longitude + latitude

# Convert to simple feature
my_df_1_sf <- st_as_sf(my_df_1_sp)
my_df_2_sf <- st_as_sf(my_df_2_sp)

# Set projection based on the epsg code
st_crs(my_df_1_sf) <- 4326
st_crs(my_df_2_sf) <- 4326

# Calculate the distance
m_dist <- st_distance(my_df_1_sf, my_df_2_sf)

# Filter for the nearest
near_index <- apply(m_dist, 1, order)[1, ]

# Based on the index in near_index to select the rows in my_df_2
# Combine with my_df_1
my_df_final <- cbind(my_df_1, my_df_2[near_index, ])
Nela answered 11/8, 2017 at 0:11 Comment(2)
Didn't know about sf and st_distance(). Works great. For anyone else reading this solution using Ubuntu 16.04, note that sf requires GDAL 2.x. You can follow the instructions here to install.Oneida
@StevenBeaupré Thanks for your comments and notes. The documentation of st_distance says that if unprojected long/lat data are provided, st_distance uses distGeo from the geosphere package as the default method to calculate distance. Users can specify other methods from geosphere in the dist_fun argument.Nela
W
2

Based on this answer you could do

library(geosphere)

mat <- distm(my_df_1[2:1], my_df_2[2:1], fun = distVincentyEllipsoid)
cbind(my_df_1, my_df_2[max.col(-mat),])

Which gives:

#     START_LAT START_LONG  ID1 latitude longitude depth_top  ID2
#10   -33.15000   163.0000 1175 -31.8482  173.2424      1303  144
#10.1 -35.60000   165.1833  528 -31.8482  173.2424      1303  144
#10.2 -34.08333   162.8833 1328 -31.8482  173.2424      1303  144
#10.3 -34.13333   162.5833  870 -31.8482  173.2424      1303  144
#10.4 -34.31667   162.7667  672 -31.8482  173.2424      1303  144
#6    -47.38333   148.9833  707 -44.6570  174.6950       555 1481
#6.1  -47.53333   148.6667  506 -44.6570  174.6950       555 1481
#10.5 -34.08333   162.9000  981 -31.8482  173.2424      1303  144
#6.2  -47.38333   148.9833  756 -44.6570  174.6950       555 1481
#6.3  -47.15000   148.7167  210 -44.6570  174.6950       555 1481
Workroom answered 11/8, 2017 at 0:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.