Greatest distance between set of longitude/latitude points

T

4

15

I have a set of lng/lat coordinates. What would be an efficient method of calculating the greatest distance between any two points in the set (the "maximum diameter" if you will)?

A naive way is to use Haversine formula to calculate the distance between each 2 points and get the maximum, but this doesn't scale well obviously.

Edit: the points are located on a sufficiently small area, measuring the area in which a person carrying a mobile device was active in the course of a single day.

Tamarau answered 31/5, 2013 at 20:16 Comment(7)

if distances are "small" (e.g., tens of mi/km), a simpler formula will provide a much better constant factor for the solution – Tensimeter 31/5, 2013 at 20:32

Could you give an example? – Tamarau 31/5, 2013 at 20:52

nearly identical to #7129982 Closest vs. Furthest should be a trivial difference between your problem and that problem. – Nkrumah 31/5, 2013 at 22:14

it's in my answer, I hope it's clear enough – Tensimeter 31/5, 2013 at 22:14

also #9589630 – Nkrumah 31/5, 2013 at 22:17

See sp package for ?spDists and geosphere package for other options on the distance calculation. – Mg 31/5, 2013 at 22:18

@hatchet: closest vs. furthest is not a trivial difference – Tensimeter 31/5, 2013 at 23:5

T

10

I think that the following could be a useful approximation, which scales linearly instead of quadratically with the number of points, and is quite easy to implement:

calculate the center of mass M of the points
find the point P₀ that has the maximum distance to M
find the point P₁ that has the maximum distance to P₀
approximate the maximum diameter with the distance between P₀ and P₁

This can be generalized by repeating step 3 N times, and taking the distance between P_N-1 and P_N

Step 1 can be carried out efficiently approximating M as the average of longitudes and latitudes, which is OK when distances are "small" and the poles are sufficiently far away. The other steps could be carried out using the exact distance formula, but they are much faster if the points' coordinates can be approximated as lying on a plane. Once the "distant pair" (hopefully the pair with the maximum distance) has been found, its distance can be re-calculated with the exact formula.

An example of approximation could be the following: if φ(M) and λ(M) are latitude and longitude of the center of mass calculated as Σφ(P)/n and Σλ(P)/n,

x(P) = (λ(P) - λ(M) + C) cos(φ(P))
y(P) = φ(P) - φ(M) [ this is only for clarity, it can also simply be y(P) = φ(P) ]

where C is usually 0, but can be ± 360° if the set of points crosses the λ=±180° line. To find the maximum distance you simply have to find

max((x(P_N) - x(P_N-1))² + (y(P_N) - y(P_N-1))²)

(you don't need the square root because it is monotonic)

The same coordinate transformation could be used to repeat step 1 (in the new coordinate system) in order to have a better starting point. I suspect that if some conditions are met, the above steps (without repeating step 3) always lead to the "true distant pair" (my terminology). If I only knew which conditions...

EDIT:

I hate building on others' solutions, but someone will have to.

Still keeping the above 4 steps, with the optional (but probably beneficial, depending on the typical distribution of points) repetition of step 3, and following the solution of Spacedman, doing calculations in 3D overcomes the limitations of closeness and distance from poles:

x(P) = sin(φ(P))
y(P) = cos(φ(P)) sin(λ(P))
z(P) = cos(φ(P)) cos(λ(P))

(the only approximation is that this holds only for a perfect sphere)

The center of mass is given by x(M) = Σx(P)/n, etc., and the maximum one has to look for is

max((x(P_N) - x(P_N-1))² + (y(P_N) - y(P_N-1))² + (z(P_N) - z(P_N-1))²)

So: you first transform spherical to cartesian coordinates, then start from the center of mass, to find, in at least two steps (steps 2 and 3), the farthest point from the preceding point. You could repeat step 3 as long as the distance increases, perhaps with a maximum number of repetitions, but this won't take you away from a local maximum. Starting from the center of mass is not of much help, either, if the points are spread all over the Earth.

EDIT 2:

I learned enough R to write down the core of the algorithm (nice language for data analysis!)

For the plane approximation, ignoring the problem around the λ=±180° line:

# input: lng, lat (vectors)
rad = pi / 180;
x = (lng - mean(lng)) * cos(lat * rad)
y = (lat - mean(lat))
i = which.max((x - mean(x))^2 + (y       )^2)
j = which.max((x - x[i]   )^2 + (y - y[i])^2)
# output: i, j (indices)

On my PC it takes less than a second to find the indices i and j for 1000000 points.
The following 3D version is a bit slower, but works for any distribution of points (and does not need to be amended when the λ=±180° line is crossed):

# input: lng, lat
rad = pi / 180
x = sin(lat * rad)
f = cos(lat * rad)
y = sin(lng * rad) * f
z = cos(lng * rad) * f
i = which.max((x - mean(x))^2 + (y - mean(y))^2 + (z - mean(z))^2)
j = which.max((x - x[i]   )^2 + (y - y[i]   )^2 + (z - z[i]   )^2)
k = which.max((x - x[j]   )^2 + (y - y[j]   )^2 + (z - z[j]   )^2) # optional
# output: j, k (or i, j)

The calculation of k can be left out (i.e., the result could be given by i and j), depending on the data and on the requirements. On the other hand, my experiments have shown that calculating a further index is useless.

It should be remembered that, in any case, the distance between the resulting points is an estimate which is a lower bound of the "diameter" of the set, although it very often will be the diameter itself (how often depends on the data.)

EDIT 3:

Unfortunately the relative error of the plane approximation can, in extreme cases, be as much as 1-1/√3 ≅ 42.3%, which may be unacceptable, even if very rare. The algorithm can be modified in order to have an upper bound of approximately 20%, which I have derived by compass and straight-edge (the analytic solution is cumbersome). The modified algorithm finds a pair of points whith a locally maximal distance, then repeats the same steps, but this time starting from the midpoint of the first pair, possibly finding a different pair:

# input: lng, lat
rad = pi / 180
x = (lng - mean(lng)) * cos(lat * rad)
y = (lat - mean(lat))
i.n_1 = 1 # n_1: n-1
x.n_1 = mean(x)
y.n_1 = 0 # = mean(y)
s.n_1 = 0 # s: square of distance
repeat {
   s = (x - x.n_1)^2 + (y - y.n_1)^2
   i.n = which.max(s)
   x.n = x[i.n]
   y.n = y[i.n]
   s.n = s[i.n]
   if (s.n <= s.n_1) break
   i.n_1 = i.n
   x.n_1 = x.n
   y.n_1 = y.n
   s.n_1 = s.n
}
i.m_1 = 1
x.m_1 = (x.n + x.n_1) / 2
y.m_1 = (y.n + y.n_1) / 2
s.m_1 = 0
m_ok  = TRUE
repeat {
   s = (x - x.m_1)^2 + (y - y.m_1)^2
   i.m = which.max(s)
   if (i.m == i.n || i.m == i.n_1) { m_ok = FALSE; break }
   x.m = x[i.m]
   y.m = y[i.m]
   s.m = s[i.m]
   if (s.m <= s.m_1) break
   i.m_1 = i.m
   x.m_1 = x.m
   y.m_1 = y.m
   s.m_1 = s.m
}
if (m_ok && s.m > s.n) {
   i = i.m
   j = i.m_1
} else {
   i = i.n
   j = i.n_1
}
# output: i, j

The 3D algorithm can be modified in a similar way. It is possible (both in the 2D and in the 3D case) to start over once again from the midpoint of the second pair of points (if found). The upper bound in this case is "left as an exercise for the reader" :-).

Comparison of the modified algorithm with the (too) simple algorithm has shown, for normal and for square uniform distributions, a near doubling of processing time, and a reduction of the average error from .6% to .03% (order of magnitude). A further restart from the midpoint results in an a just slightly better average error, but almost equal maximum error.

EDIT 4:

I have to study this article yet, but it looks like the 20% I found with compass and straight-edge is in fact 1-1/√(5-2√3) ≅ 19.3%

Tensimeter answered 31/5, 2013 at 20:17 Comment(6)

Perhaps you can provide a small practical example of how this would work in r? (the language the OP is trying to achieve this in). – Goshawk 31/5, 2013 at 22:20

@SimonO101: sorry, I don't know r :-( – Tensimeter 31/5, 2013 at 22:25

@SimonO101: now I know a bit of R :-) – Tensimeter 4/6, 2013 at 23:16

Great! Glad to hear it and nice solution and algorithm. +1 from me. – Goshawk 5/6, 2013 at 9:18

Thanks. Fast and simple approximation, works great for my problem. The final distance can be calculated using geosphere::distHaversine(c(lat[i], lng[i]), c(lat[j], lng[j])) – Tamarau 6/6, 2013 at 19:30

@Jeroen: never accept code from strangers, if it is not thoroughly tested and reviewed ;-) After you accepted my answer, I came to fear that you or someone else might use my code in production. So I did some real tests and also some real work with compass and straight-edge. To fix what I found, I had to modify the algorithm, which now takes nearly twice as long on average, but lets me sleep much better (and maybe you too). – Tensimeter 8/6, 2013 at 16:12

S

11

Theorem #1: The ordering of any two great circle distances along the surface of the earth is the same as the ordering as the straight line distance between the points where you tunnel through the earth.

Hence turn your lat-long into x,y,z based either on a spherical earth of arbitrary radius or an ellipsoid of given shape parameters. That's a couple of sines/cosines per point (not per pair of points).

Now you have a standard 3-d problem that doesn't rely on computing Haversine distances. The distance between points is just Euclidean (Pythagoras in 3d). Needs a square-root and some squares, and you can leave out the square root if you only care about comparisons.

There may be fancy spatial tree data structures to help with this. Or algorithms such as http://www.tcs.fudan.edu.cn/rudolf/Courses/Algorithms/Alg_ss_07w/Webprojects/Qinbo_diameter/2d_alg.htm (click 'Next' for 3d methods). Or C++ code here: http://valis.cs.uiuc.edu/~sariel/papers/00/diameter/diam_prog.html

Once you've found your maximum distance pair, you can use the Haversine formula to get the distance along the surface for that pair.

Stegman answered 1/6, 2013 at 7:40 Comment(10)

correct, and always applicable, while my solution is not. The only drawback is that it is O(n log n) (but only approximations can be better than that) – Tensimeter 1/6, 2013 at 8:7

although... in THEORY, your theorem #1 only holds on a perfectly spherical Earth, not on a generic ellipsoid... – Tensimeter 1/6, 2013 at 8:12

You'll notice there's no proof of theorem #1 :) I should probably have called it an unproved hypothesis... I'm still trying to find a counter-example for an ellipsoid... Ah, polar distance vs distance between diametrically opposite points on the equator for an oblate spheroid... – Stegman 1/6, 2013 at 9:45

Nifty. To prove your theorem, refer to the formula relating chord length and angle (equivalently great circle length on a sphere). The relationship (on a unit circle) is "chord length = 2*sin(angle)", which is monotonically increasing from 0 to pi, which proves your point about the ordering of the two quantities being identical. – Mcnally 1/6, 2013 at 18:40

It is worth noting that the C++ code cited (valis.cs.uiuc.edu/~sariel/papers/00/diameter/diam_prog.html) is worst-case quadratic but is allegedly linear "for most real inputs". A nice find. – Deutschland 1/6, 2013 at 18:52

A competing algorithm: www-sop.inria.fr/members/Gregoire.Malandain/diameter Has source code. – Deutschland 1/6, 2013 at 18:56

Could you perhaps add some pseudo/R code on the step to convert lat/lng to the euclidean system? – Tamarau 3/6, 2013 at 21:9

@Jeroen - require(rgdal) ; spTransform(points, CRS=CRS("+proj=merc +ellps=WGS84")), and for the convex hull use rgeos::gConvexHull – Deutschland 4/6, 2013 at 7:55

@DeerHunter that converts to a 2d coord system. I'm talking about converting to a 3d coord system centred at the centre of the earth. – Stegman 4/6, 2013 at 8:42

Spacedman - for the task @Jeroen faces, 3D solution is an overkill. – Deutschland 4/6, 2013 at 9:59

T

10

I think that the following could be a useful approximation, which scales linearly instead of quadratically with the number of points, and is quite easy to implement:

calculate the center of mass M of the points
find the point P₀ that has the maximum distance to M
find the point P₁ that has the maximum distance to P₀
approximate the maximum diameter with the distance between P₀ and P₁

This can be generalized by repeating step 3 N times, and taking the distance between P_N-1 and P_N

Step 1 can be carried out efficiently approximating M as the average of longitudes and latitudes, which is OK when distances are "small" and the poles are sufficiently far away. The other steps could be carried out using the exact distance formula, but they are much faster if the points' coordinates can be approximated as lying on a plane. Once the "distant pair" (hopefully the pair with the maximum distance) has been found, its distance can be re-calculated with the exact formula.

An example of approximation could be the following: if φ(M) and λ(M) are latitude and longitude of the center of mass calculated as Σφ(P)/n and Σλ(P)/n,

x(P) = (λ(P) - λ(M) + C) cos(φ(P))
y(P) = φ(P) - φ(M) [ this is only for clarity, it can also simply be y(P) = φ(P) ]

where C is usually 0, but can be ± 360° if the set of points crosses the λ=±180° line. To find the maximum distance you simply have to find

max((x(P_N) - x(P_N-1))² + (y(P_N) - y(P_N-1))²)

(you don't need the square root because it is monotonic)

The same coordinate transformation could be used to repeat step 1 (in the new coordinate system) in order to have a better starting point. I suspect that if some conditions are met, the above steps (without repeating step 3) always lead to the "true distant pair" (my terminology). If I only knew which conditions...

EDIT:

I hate building on others' solutions, but someone will have to.

Still keeping the above 4 steps, with the optional (but probably beneficial, depending on the typical distribution of points) repetition of step 3, and following the solution of Spacedman, doing calculations in 3D overcomes the limitations of closeness and distance from poles:

x(P) = sin(φ(P))
y(P) = cos(φ(P)) sin(λ(P))
z(P) = cos(φ(P)) cos(λ(P))

(the only approximation is that this holds only for a perfect sphere)

The center of mass is given by x(M) = Σx(P)/n, etc., and the maximum one has to look for is

max((x(P_N) - x(P_N-1))² + (y(P_N) - y(P_N-1))² + (z(P_N) - z(P_N-1))²)

So: you first transform spherical to cartesian coordinates, then start from the center of mass, to find, in at least two steps (steps 2 and 3), the farthest point from the preceding point. You could repeat step 3 as long as the distance increases, perhaps with a maximum number of repetitions, but this won't take you away from a local maximum. Starting from the center of mass is not of much help, either, if the points are spread all over the Earth.

EDIT 2:

I learned enough R to write down the core of the algorithm (nice language for data analysis!)

For the plane approximation, ignoring the problem around the λ=±180° line:

# input: lng, lat (vectors)
rad = pi / 180;
x = (lng - mean(lng)) * cos(lat * rad)
y = (lat - mean(lat))
i = which.max((x - mean(x))^2 + (y       )^2)
j = which.max((x - x[i]   )^2 + (y - y[i])^2)
# output: i, j (indices)

On my PC it takes less than a second to find the indices i and j for 1000000 points.
The following 3D version is a bit slower, but works for any distribution of points (and does not need to be amended when the λ=±180° line is crossed):

# input: lng, lat
rad = pi / 180
x = sin(lat * rad)
f = cos(lat * rad)
y = sin(lng * rad) * f
z = cos(lng * rad) * f
i = which.max((x - mean(x))^2 + (y - mean(y))^2 + (z - mean(z))^2)
j = which.max((x - x[i]   )^2 + (y - y[i]   )^2 + (z - z[i]   )^2)
k = which.max((x - x[j]   )^2 + (y - y[j]   )^2 + (z - z[j]   )^2) # optional
# output: j, k (or i, j)

The calculation of k can be left out (i.e., the result could be given by i and j), depending on the data and on the requirements. On the other hand, my experiments have shown that calculating a further index is useless.

It should be remembered that, in any case, the distance between the resulting points is an estimate which is a lower bound of the "diameter" of the set, although it very often will be the diameter itself (how often depends on the data.)

EDIT 3:

Unfortunately the relative error of the plane approximation can, in extreme cases, be as much as 1-1/√3 ≅ 42.3%, which may be unacceptable, even if very rare. The algorithm can be modified in order to have an upper bound of approximately 20%, which I have derived by compass and straight-edge (the analytic solution is cumbersome). The modified algorithm finds a pair of points whith a locally maximal distance, then repeats the same steps, but this time starting from the midpoint of the first pair, possibly finding a different pair:

# input: lng, lat
rad = pi / 180
x = (lng - mean(lng)) * cos(lat * rad)
y = (lat - mean(lat))
i.n_1 = 1 # n_1: n-1
x.n_1 = mean(x)
y.n_1 = 0 # = mean(y)
s.n_1 = 0 # s: square of distance
repeat {
   s = (x - x.n_1)^2 + (y - y.n_1)^2
   i.n = which.max(s)
   x.n = x[i.n]
   y.n = y[i.n]
   s.n = s[i.n]
   if (s.n <= s.n_1) break
   i.n_1 = i.n
   x.n_1 = x.n
   y.n_1 = y.n
   s.n_1 = s.n
}
i.m_1 = 1
x.m_1 = (x.n + x.n_1) / 2
y.m_1 = (y.n + y.n_1) / 2
s.m_1 = 0
m_ok  = TRUE
repeat {
   s = (x - x.m_1)^2 + (y - y.m_1)^2
   i.m = which.max(s)
   if (i.m == i.n || i.m == i.n_1) { m_ok = FALSE; break }
   x.m = x[i.m]
   y.m = y[i.m]
   s.m = s[i.m]
   if (s.m <= s.m_1) break
   i.m_1 = i.m
   x.m_1 = x.m
   y.m_1 = y.m
   s.m_1 = s.m
}
if (m_ok && s.m > s.n) {
   i = i.m
   j = i.m_1
} else {
   i = i.n
   j = i.n_1
}
# output: i, j

The 3D algorithm can be modified in a similar way. It is possible (both in the 2D and in the 3D case) to start over once again from the midpoint of the second pair of points (if found). The upper bound in this case is "left as an exercise for the reader" :-).

Comparison of the modified algorithm with the (too) simple algorithm has shown, for normal and for square uniform distributions, a near doubling of processing time, and a reduction of the average error from .6% to .03% (order of magnitude). A further restart from the midpoint results in an a just slightly better average error, but almost equal maximum error.

EDIT 4:

I have to study this article yet, but it looks like the 20% I found with compass and straight-edge is in fact 1-1/√(5-2√3) ≅ 19.3%

Tensimeter answered 31/5, 2013 at 20:17 Comment(6)

Perhaps you can provide a small practical example of how this would work in r? (the language the OP is trying to achieve this in). – Goshawk 31/5, 2013 at 22:20

@SimonO101: sorry, I don't know r :-( – Tensimeter 31/5, 2013 at 22:25

@SimonO101: now I know a bit of R :-) – Tensimeter 4/6, 2013 at 23:16

Great! Glad to hear it and nice solution and algorithm. +1 from me. – Goshawk 5/6, 2013 at 9:18

Thanks. Fast and simple approximation, works great for my problem. The final distance can be calculated using geosphere::distHaversine(c(lat[i], lng[i]), c(lat[j], lng[j])) – Tamarau 6/6, 2013 at 19:30

@Jeroen: never accept code from strangers, if it is not thoroughly tested and reviewed ;-) After you accepted my answer, I came to fear that you or someone else might use my code in production. So I did some real tests and also some real work with compass and straight-edge. To fix what I found, I had to modify the algorithm, which now takes nearly twice as long on average, but lets me sleep much better (and maybe you too). – Tensimeter 8/6, 2013 at 16:12

M

3

Here's a naive example that doesn't scale well (as you say), as you say but might help with building a solution in R.

## lonlat points
n <- 100
d <- cbind(runif(n, -180, 180), runif(n, -90, 90))


library(sp)
## distances on WGS84 ellipsoid
x <- spDists(d, longlat = TRUE)

## row, then column index of furthest points
ind <- c(row(x)[which.max(x)], col(x)[which.max(x)])

## maps
library(maptools)
data(wrld_simpl)
plot(as(wrld_simpl, "SpatialLines"), col = "grey")

points(d, pch = 16, cex = 0.5)

## draw the points and a line between  on the page
points(d[ind, ], pch = 16)
lines(d[ind, ], lwd = 2)


## for extra credit, draw the great circle on which the furthest points lie
library(geosphere)


lines(greatCircle(d[ind[1], ], d[ind[2], ]), col = "firebrick")

Find furthest distance on WGS84 ellipsoid between sample points

The geosphere package provides more options for distance calculation if that's needed. See ?spDists in sp for the details used here.

Mg answered 1/6, 2013 at 0:5 Comment(2)

+1 for demonstrating the sp & geosphere machinery. I feel like for large numbers of points, the quickest search might be one that: (1) sections the globe's surface into a grid; (2) calculates minimum and maximum distances among all occupied grid cells (using their vertices); (3) only keeps points in the set of cells that are in their entirety farther from one another than any other cells; and then (4) subdivides them, repeating steps 2, 3, and 4 until the number of points has been sufficiently winnowed down. Lots of book-keeping required, but it should run pretty quickly in most cases. – Mcnally 1/6, 2013 at 1:12

I was thinking something similar, you could rough it up with raster pretty easily probably, but it's not something for me today. It's a good problem, hopefully I'll have a chance to explore some of these ideas (and Walter's). I tried this on 20000 points and it grinds through it ok but it's very wasteful, and 50000 was too much for 16Gb RAM. :) – Mg 1/6, 2013 at 1:55

D

3

You don't tell us whether these points will be located in a sufficiently small part of the globe. For truly global sets of points, my first guess would be running a naive O(n^2) algorithm, possibly getting performance boost with some spatial indexing (R*-trees, octal-trees etc.). The idea is to pre-generate an n*(n-1) list of the triangle in the distance matrix and feed it in chunks to a fast distance library to minimize I/O and process churn. Haversine is fine, you could also do it with Vincenty's method (the greatest contributor to running time is quadratic complexity, not the (fixed number of) iterations in Vincenty's formula). As a side note, in fact, you don't need R for this stuff.

EDIT #2: The Barequet-Har-Peled algorithm (as pointed at by Spacedman in his reply) has O((n+1/(e^3))log(1/e)) complexity for e>0, and is worth exploring.

For the quasi-planar problem, this is known as "diameter of convex hull" and has three parts:

Computing convex hull with Graham's scan which is O(n*log(n)) - in fact, one should try transforming points into a transverse Mercator projection (using the centroid of the points in data set).
Finding antipodal points by Rotating Calipers algorithm - linear O(n).
Finding the largest distance among all antipodal pairs - linear search, O(n).

The link with pseudo-code and discussion: http://fredfsh.com/2013/05/03/convex-hull-and-its-diameter/

See also the discussion on a related question here: https://gis.stackexchange.com/questions/17358/how-can-i-find-the-farthest-point-from-a-set-of-existing-points

EDIT: Spacedman's solution pointed me to the Malandain-Boissonnat algorithm (see the paper in pdf here). However, this is worse or the same as the bruteforce naive O(n^2) algorithm.

Deutschland answered 1/6, 2013 at 18:39 Comment(0)

Recommended topics

Hot tags