Calculate the distance between two points of two datasets (nearest neighbor)
Asked Answered
D

2

6

I want to calculate the distance between two points in two different datasets. I don't want to calculate the distance between all points - just to the nearest point of datasetB.
Some examples:

Dataset A - Persons
http://pastebin.com/HbaeqACi

Dataset B - Waterfeatures:
http://pastebin.com/UdDvNtHs

Dataset C - City:
http://pastebin.com/nATnkMRk

So...I want to calculate the distance of each person to the nearest waterfeature point.
I've already tried to work with the rgeos package and after struggling with some projections errors, I've got it to work. But this calculate (at least I assume it) all distances to every point, but, as already said, I've only interested in the distance to the nearest waterfeature point.

# load csv files
persons = read.csv("persons.csv", header = TRUE)
water = read.csv("water.csv", header = TRUE)
# change dataframes to SpatialPointDataFrame and assign a projection
library(sp)
library(rgeos)
coordinates(persons) <- c("POINT_X", "POINT_Y")
proj4string(persons) <- CRS("+proj=utm +datum=WGS84")
coordinates(water) <- c("POINT_X", "POINT_Y")
proj4string(water) <- CRS("+proj=utm +datum=WGS84")

# use rgoes package to calculate the distance
distance <- gDistance(persons, water, byid=TRUE)
# works, but calculates a huge number of distances

Is there any parameter, which I've missed. Or do I need to use another package or function? I've also looked at spatstat, which is able to calculate the distance to the nearest neighbor, but not of two different datasets: http://hosho.ees.hokudai.ac.jp/~kubo/Rdoc/library/spatstat/html/nndist.html


Edit:
The complete R-Script including plotting of the datasets:

library(RgoogleMaps)
library(ggplot2)
library(ggmap)
library(sp)
library(fossil)

#load data
persons = read.csv("person.csv", header = TRUE, stringsAsFactors=FALSE)
water = read.csv("water.csv", header =TRUE, stringsAsFactors=FALSE)
city = read.csv("city.csv", header =TRUE)

# plot data
persons_ggplot2 <- persons
city_ggplot2 <- city
water_ggplot2 <- water
gc <- geocode('new york, usa')
center <- as.numeric(gc)  
G <- ggmap(get_googlemap(center = center, color = 'bw', scale = 1, zoom = 11, maptype = "terrain", frame=T), extent="device")
G1 <- G + geom_point(aes(x=POINT_X, y=POINT_Y ),data=city, shape = 22, color="black", fill = "yellow", size = 4) + geom_point(aes(x=POINT_X, y=POINT_Y ),data=persons, shape = 8, color="red", size=2.5) + geom_point(aes(x=POINT_X, y=POINT_Y ),data=water_ggplot2, color="blue", size=1)
plot(G1)

#### calculate distance
# Generate unique coordinates dataframe
UniqueCoordinates <- data.frame(unique(persons[,4:5]))
UniqueCoordinates$Id <- formatC((1:nrow(UniqueCoordinates)), width=3,flag=0)

# Generate a function that looks for the closest waterfeature for each id coordinates
NearestW <- function(id){
tmp <- UniqueCoordinates[UniqueCoordinates$Id==id, 1:2]
WaterFeatures <- rbind(tmp,water[,2:3])
tmp1 <- earth.dist(WaterFeatures, dist=TRUE)[1:(nrow(WaterFeatures)-1)]
tmp1 <- which.min(tmp1)
tmp1 <- water[tmp1,1]
tmp1 <- data.frame(tmp1, WaterFeature=tmp)
return(tmp1)
}

#apply to each id and the merge
CoordinatesWaterFeature <- ldply(UniqueCoordinates$Id, NearestW)
persons <- merge(persons, CoordinatesWaterFeature, by.x=c(4,5), by.y=c(2,3))

enter image description here

Dehumidify answered 1/3, 2014 at 23:21 Comment(0)
G
5

What about writing a function that looks for the nearest waterfeature for every person?

#requires function earth.dist from "fossil" package
require(fossil)

#load data
persons = read.csv("person.csv", header = TRUE, stringsAsFactors=FALSE)
water = read.csv("water.csv", header =TRUE, stringsAsFactors=FALSE)

#Generate unique coordinates dataframe
UniqueCoordinates <- data.frame(unique(persons[,4:5]))
UniqueCoordinates$Id <- formatC((1:nrow(UniqueCoordinates)), width=3,flag=0)


#Generate a function that looks for the closest waterfeature for each id coordinates
NearestW <- function(id){
   tmp <- UniqueCoordinates[UniqueCoordinates$Id==id, 1:2]
   WaterFeatures <- rbind(tmp,water[,2:3])
   tmp1 <- earth.dist(WaterFeatures, dist=TRUE)[1:(nrow(WaterFeatures)-1)]
   tmp1 <- min(tmp1)
   tmp1 <- data.frame(tmp1, WaterFeature=tmp)
   return(tmp1)
 }

#apply to each id and the merge
CoordinatesWaterFeature <- ldply(UniqueCoordinates$Id, NearestW)
persons <- merge(persons, CoordinatesWaterFeature, by.x=c(4,5), by.y=c(2,3))

NOTE: I've added a stringsAsFactors parameter to the original read.csv , it make the merging easier at the end

NOTE:Column tmp1 notes the number of METERS to the nearest water feature

Grolier answered 2/3, 2014 at 0:38 Comment(13)
Thanks for this tip. Unfortunately I get the following error aftetr executing persons$nearest <- sapply(persons$Id, NearestW): Error in $<-.data.frame(*tmp*, "nearest", value = list()) : replacement has 0 rows, data has 164. I've save it to a new data.frame with persons_nearest <- sapply(persons$Id, NearestW), but the output is a empty list.Dehumidify
What output do you get if you run NearestW("001") ?Grolier
By running NearestW("001") I get the following output: > [1] 0Dehumidify
I'm not sure what the problem is. I loaded the data from the links you provided and I've gotten it to work in my machine but maybe it imported differently from the one you got. Just to check... after doing the read.csv is persons 5 columns wide and water 3 columns wide?Grolier
Argh...I was even too bad to paste your code properly into my R-Script (I've missed the line, where a person ID is generated). Everything works now :) Btw: You've said, the computation could be more faster by calculating the neareast water feature for each unique set of coordinates and then merge these. How does this look? Because...I've transferred your approach to my real dataset (about 2500 persons and 40 cities (so these 2500 persons are located over 40 locations) and the calculation needs already 20min with an i5 @ 4*4Ghz oO.Dehumidify
Don't worry. I've edited the code to reflect the unique coordinate thing and then merge with persons.dfGrolier
eclark: You're my hero! The performance difference is incredible! I think, tomorrow there will occur some more questions, so I'm very glad that you've helped me (also thanks for yesterday :)). One questions left: The tmp1 is the distance right? It the distance in meters?Dehumidify
No problem. Glad to help. Yeah tmp1 is the distance and its in kmsGrolier
Thanks. Sorry for this (maybe stupid) question: kms is = meters?Dehumidify
Its kilometers so its 1000 metersGrolier
Mhh...this is crazy, because no of the persons is more than a few hundreds meters away from a waterfeature point?! I've added the plotting and the plot itself to my first post to visualize the fact. Is there an error in the calculation or did I miss anything else?Dehumidify
Sorry I've miss read, tmp1 is NOT the distance, tmp1 is the id of the nearest water feature as per the first column of the water data.frame. Does that make sense? Give me a second to make tmp1 the distance. I'll edit it in the answerGrolier
Awesome! Works like a charm. I've checked the distance with ArcGIS' measurement tool and it's the same. Thanks! <3Dehumidify
I
2

Maybe I'm a little too late, but you can use spatstat to compute distances between two different datasets. The command is nncross. The arguments you have to use are two objects of type ppp, which you can create using the as.ppp() function.

Iamb answered 22/11, 2015 at 20:39 Comment(2)
Maybe to late for the reason why I've asked the question, but definitely helpful for the future! Thanks! :)Dehumidify
However, if you use spatstat it is crucial that you project to planar coordinates first. It does not automagically recognize longitude,latitude data. However, the C code for nncross is quite efficient, so you may experience a significant speed gain for large datasets.Villous

© 2022 - 2024 — McMap. All rights reserved.