How to find out Geometric Median
Asked Answered
S

7

33

The question is:

Given N points(in 2D) with x and y coordinates, find a point P (in N given points) such that the sum of distances from other(N-1) points to P is minimum.

This point is commonly known as Geometric Median. Is there any efficient algorithm to solve this problem, other than the naive O(N^2) one?

Shiest answered 17/10, 2012 at 12:20 Comment(15)
well, if you calculate the center of al points (avg(x), avg(y)) it would be the point out of N closest to that point. Sounds more like a O(N) algrithm.Banish
@JeroenVuurens: I don't think that works -- I think for [(-L,0), (L,0)]*25 + [(0,1), (0,2), (0,3)] where L is large you'll pick (0,1) instead of (0,2)Agnate
Did you check the Wikipedia article (en.wikipedia.org/wiki/Geometric_median#Computation) about the computation of the geometric median?Ceremony
@Ceremony it isn't the geometric median the OP's looking for, actually. Although the question is similar.Clara
@Qnan: I don't see the difference between the OP's problem and the geometric median, could you elaborate?Grivet
@larsmans the real geometric median doesn't have to belong to the set of points in question.Clara
I think my suggestion rather minimizes the sum of squared distances to all points, instead of the sum of distances. So if it is the geomeric median you are after, brute force seems to be the solution.Banish
@JeroenVuurens I agree. Although some pruning may help the performance on large datasets.Clara
And BTW, such questions may be more suited for cs.stackexchange.comClara
@Qnan: are, you're right, it's the medoid rather than the median.Grivet
@Qnan, interesting idea, not sure if that works. If you imagine a big blob of dots, it can even be one on the edge if multiple dots can be stacked on top of each other.Banish
@larsmans don't think OP is referring to mediods. And anyways, any K-means algorithm converges to a local optimum, so if the assignment is to find the best solution, there is no garantee there.Banish
@JeroenVuurens: it's the medoid per definition: a point "whose average dissimilarity to all the objects in the cluster is minimal". That's equivalent to the sum of distances because there's only one "cluster" here, the set of N given points (just divide all the distances by N). And I wasn't suggesting use of K-means, that wouldn't make any sense.Grivet
pnas.org/content/97/4/1423.full.pdfJaniuszck
Recent improvements have been made arxiv.org/abs/1606.05225Reprint
C
25

I solved something similar for a local online judge once using simulated annealing. That was the official solution as well and the program got AC.

The only difference was that the point I had to find did not have to be part of the N given points.

This was my C++ code, and N could be as large as 50000. The program executes in 0.1s on a 2ghz pentium 4.

// header files for IO functions and math
#include <cstdio>
#include <cmath>

// the maximul value n can take
const int maxn = 50001;

// given a point (x, y) on a grid, we can find its left/right/up/down neighbors
// by using these constants: (x + dx[0], y + dy[0]) = upper neighbor etc.
const int dx[] = {-1, 0, 1, 0};
const int dy[] = {0, 1, 0, -1};

// controls the precision - this should give you an answer accurate to 3 decimals
const double eps = 0.001;

// input and output files
FILE *in = fopen("adapost2.in","r"), *out = fopen("adapost2.out","w");

// stores a point in 2d space
struct punct
{
    double x, y;
};

// how many points are in the input file
int n;

// stores the points in the input file
punct a[maxn];

// stores the answer to the question
double x, y;

// finds the sum of (euclidean) distances from each input point to (x, y)
double dist(double x, double y)
{
    double ret = 0;

    for ( int i = 1; i <= n; ++i )
    {
        double dx = a[i].x - x;
        double dy = a[i].y - y;

        ret += sqrt(dx*dx + dy*dy); // classical distance formula
    }

    return ret;
}

// reads the input
void read()
{
    fscanf(in, "%d", &n); // read n from the first 

    // read n points next, one on each line
    for ( int i = 1; i <= n; ++i )
        fscanf(in, "%lf %lf", &a[i].x, &a[i].y), // reads a point
        x += a[i].x,
        y += a[i].y; // we add the x and y at first, because we will start by approximating the answer as the center of gravity

    // divide by the number of points (n) to get the center of gravity
    x /= n; 
    y /= n;
}

// implements the solving algorithm
void go()
{
    // start by finding the sum of distances to the center of gravity
    double d = dist(x, y);

    // our step value, chosen by experimentation
    double step = 100.0;

    // done is used to keep track of updates: if none of the neighbors of the current
    // point that are *step* steps away improve the solution, then *step* is too big
    // and we need to look closer to the current point, so we must half *step*.
    int done = 0;

    // while we still need a more precise answer
    while ( step > eps )
    {
        done = 0;
        for ( int i = 0; i < 4; ++i )
        {
            // check the neighbors in all 4 directions.
            double nx = (double)x + step*dx[i];
            double ny = (double)y + step*dy[i];

            // find the sum of distances to each neighbor
            double t = dist(nx, ny);

            // if a neighbor offers a better sum of distances
            if ( t < d )
            {
                update the current minimum
                d = t;
                x = nx;
                y = ny;

                // an improvement has been made, so
                // don't half step in the next iteration, because we might need
                // to jump the same amount again
                done = 1;
                break;
            }
        }

        // half the step size, because no update has been made, so we might have
        // jumped too much, and now we need to head back some.
        if ( !done )
            step /= 2;
    }
}

int main()
{
    read();
    go();

    // print the answer with 4 decimal points
    fprintf(out, "%.4lf %.4lf\n", x, y);

    return 0;
}

Then I think It's correct to pick the one from your list that is closest to the (x, y) returned by this algorithm.

This algorithm takes advantage of what this wikipedia paragraph on the geometric median says:

However, it is straightforward to calculate an approximation to the geometric median using an iterative procedure in which each step produces a more accurate approximation. Procedures of this type can be derived from the fact that the sum of distances to the sample points is a convex function, since the distance to each sample point is convex and the sum of convex functions remains convex. Therefore, procedures that decrease the sum of distances at each step cannot get trapped in a local optimum.

One common approach of this type, called Weiszfeld's algorithm after the work of Endre Weiszfeld,[4] is a form of iteratively re-weighted least squares. This algorithm defines a set of weights that are inversely proportional to the distances from the current estimate to the samples, and creates a new estimate that is the weighted average of the samples according to these weights. That is,

The first paragraph above explains why this works: because the function we are trying to optimize does not have any local minimums, so you can greedily find the minimum by iteratively improving it.

Think of this as a sort of binary search. First, you approximate the result. A good approximation will be the center of gravity, which my code computes when reading the input. Then, you see if adjacent points to this give you a better solution. In this case, a point is considered adjacent if it as a distance of step away from your current point. If it is better, then it is fine to discard your current point, because, as I said, this will not trap you into a local minimum because of the nature of the function you are trying to minimize.

After this, you half the step size, just like in binary search, and continue until you have what you consider to be a good enough approximation (controlled by the eps constant).

The complexity of the algorithm therefore depends on how accurate you want the result to be.

Casas answered 17/10, 2012 at 12:38 Comment(18)
Please explain it, it is more than Greek to me!Shiest
It's finding the actual en.wikipedia.org/wiki/Geometric_median, I believe. Well, approximately.Clara
Still, there is no formal proof that the closest point in the dataset would be your answer.Clara
@Cupidvogel - I have to go now and to be honest it was 3 years ago and I don't remember much about it :). I will try to add more explanations later, but you might want to read the simulated annealing article I linked to, and also the wikipedia geometric median page.Casas
@Clara - true, I'm not sure about that.Casas
@Casas well, it's easy to prove it'd work in 1-dimensional case. I think it would provide a reasonable approximation in 2d as well, but I'm not sure it's better (or faster) than using O(n^2) exact algorithm.Clara
@Clara - it's definitely faster. O(n^2) would not have gotten accepted with such high n on that judge. Changing the eps value controls the approximation quality, but affects speed too obviously.Casas
I don't think it's hard to imagine that in 2-D the fitness function could be "steeper" in one direction than in another, and hence that the closest point in the set to your estimate, is very slightly worse than some other point in the set that is very slightly further away. The question then is whether "very slight" remains within any tolerance for approximation in the original question -- and if not can we bound it so as to only need to check a small number of points properly. Worst case is many points clustered in an almost-circle around your estimate.Fubsy
lVlad, please explain your answer!Shiest
@Cupidvogel I added an explanation, let me know if you have any questionsCasas
Please comment the code, as intensively as possible! I really need to understand it completely. I know it's asking too much, but please do me this favor, the least I can do in return is award you a handsome bounty tomorrow!Shiest
@Cupidvogel I have commented the code, let me know if there is anything else I can help with. Note that I still don't know if this answers your original question. I don't know how well choosing the point from your set that is closest to the point returned by my algorithm will work.Casas
Yes, but if we find out the closet point like this, then we can do a linear scan of all points to determine which point is closest to this point. Will that be the answer?Shiest
@Cupidvogel - that's what I meant. My algorithm find the geometric median, as defined on wikipedia, so it's not necessarily part of your input. If you do a linear scan and pick the closest point in the set to the point my solution found, I think that will be the answer, but I am not sure it is always the case.Casas
Exactly. I too think that even though it looks like that will be the point, it's difficult to prove it rigorously.Shiest
This answer is not simulated annealing, it's a Newton search with first order Taylor series terms; sometimes known as a first order search; and the start position is the mean.Upbeat
Could it be done in a simpler way if all the points were on a straight line and we just needed to find some point from which the sum of distances to all points is the minimum?Gentille
I'm using a 3D point, but I don't know how to guess the value of dz, like const int dy[] = {0, 1, 0, -1}; What will be the value of dz? Thanks.Relief
U
10

It appears that the problem is difficult to solve in better than O(n^2) time when using Euclidean distances. However the point that minimizes the sum of Manhattan distances to other points or the point that minimizes the sum of squares of Euclidean distances to other points can be found in O(n log n) time. (Assuming multiplying two numbers is O(1)). Let me shamelessly copy/paste my solution for Manhattan distances from a recent post:

Create a sorted array of x-coordinates and for each element in the array compute the "horizontal" cost of choosing that coordinate. The horizontal cost of an element is the sum of distances to all the points projected onto the X-axis. This can be computed in linear time by scanning the array twice (once from left to right and once in the reverse direction). Similarly create a sorted array of y-coordinates and for each element in the array compute the "vertical" cost of choosing that coordinate.

Now for each point in the original array, we can compute the total cost to all other points in O(1) time by adding the horizontal and vertical costs. So we can compute the optimal point in O(n). Thus the total running time is O(n log n).

We can follow a similar approach for computing the point that minimizes the sum of squares of Euclidean distances to other points. Let the sorted x-coordinates be: x1, x2, x3, ..., xn. We scan this list from left to right and for each point xi we compute:

li = sum of distances to all the elements to the left of xi = (xi-x1) + (xi-x2) + .... + (xi-xi-1) , and

sli = sum of squares of distances to all the elements to the left of xi = (xi-x1)^2 + (xi-x2)^2 + .... + (xi-xi-1)^2

Note that given li and sli we can compute li+1 and sli+1 in O(1) time as follows:

Let d = xi+1-xi. Then:

li+1 = li + id and sli+1 = sli + id^2 + 2*i*d

Thus we can compute all the li and sli in linear time by scanning from left to right. Similarly, for every element we can compute the ri: sum of distances to all elements to the right and the sri: sum of squares of distances to all the elements to the right in linear time. Adding sri and sli for each i, gives the sum of squares of horizontal distances to all the elements, in linear time. Similarly, compute the sum of squares of vertical distances to all the elements.

Then we can scan through the original points array and find the point that minimizes the sum of squares of vertical and horizontal distances as before.

Use answered 17/10, 2012 at 14:39 Comment(7)
I'm reminded of stories by professors of students answering questions from previous exams and hoping to get marks. It's nice to know that this can be done efficiently with other metrics, but it doesn't answer the question.Agnate
Of course it doesn't solve the original problem and I mentioned it in the very beginning. I thought the OP may be interested since this solution can be a good approximation to the original problem (because the metrics are similar) and can be found very efficiently.Use
@srbh.kmr Not necessarily. Consider 5 1-D points on the real line: 0, 0, a, a+b, a+b+c. The point that minimizes the sum of distances to other points is the one at a. But the point that minimizes the sum of squares of distances is the one at a+b, provided 2c > b + 4a.Use
Oh. okay, I got that. I just deleted my comment. Thanks for explaining anyways.Britain
Hi @krjampani, for the "sum of Manhattan distances", I didn't understand why we need "sort", why can not just use a leftRightMemo[], and a rightLeftMemo[] to sum the Manhattan distances, and then for the point P(x, y), the x-coordinates sum = leftRightMemo[i] + rightLeftMemo[i] - x ?Externalize
What are leftRightMemo and rightLeftMemo ? How do you calculate them ?Use
but I think your last equation is in correct. It should be : sli+1 = sli + id^2 + 2 * li * d Note : li instead of i in the last part.Unpolled
M
6

As mentioned earlier, the type of algorithm to use depends on the way you measure distance. Since your question does not specify this measure, here are C implementations for both the Manhattan distance and the Squared Euclidean distance. Use dim = 2 for 2D points. Complexity O(n log n).

Manhattan distance

double * geometric_median_with_manhattan(double **points, int N, int dim) {
    for (d = 0; d < dim; d++) {
        qsort(points, N, sizeof(double *), compare);
        double S = 0;
        for (int i = 0; i < N; i++) {
            double v = points[i][d];
            points[i][dim] += (2 * i - N) * v - 2 * S;
            S += v;
        }
    }
    return min(points, N, dim);
}

Short explanation: We can sum the distance per dimension, 2 in your case. Say we have N points and the values in one dimension are v_0, .., v_(N-1) and T = v_0 + .. + v_(N-1). Then for each value v_i we have S_i = v_0 .. v_(i-1). Now we can express the Manhattan distance for this value by summing those on the left side: i * v_i - S_i and the right side: T - S_i - (N - i) * v_i, which results in (2 * i - N) * v_i - 2 * S_i + T. Adding T to all elements does not change the order, so we leave that out. And S_i can be computed on the fly.

Here is the rest of the code that makes it into an actual C program:

#include <stdio.h>
#include <stdlib.h>

int d = 0;
int compare(const void *a, const void *b) {
    return (*(double **)a)[d] - (*(double **)b)[d];
}

double * min(double **points, int N, int dim) {
    double *min = points[0];
    for (int i = 0; i < N; i++) {
        if (min[dim] > points[i][dim]) {
            min = points[i];
        }
    }
    return min;
}

int main(int argc, const char * argv[])
{
    // example 2D coordinates with an additional 0 value
    double a[][3] = {{1.0, 1.0, 0.0}, {3.0, 1.0, 0.0}, {3.0, 2.0, 0.0}, {0.0, 5.0, 0.0}};
    double *b[] = {a[0], a[1], a[2], a[3]};
    double *min = geometric_median_with_manhattan(b, 4, 2);
    printf("geometric median at {%.1f, %.1f}\n", min[0], min[1]);
    return 0;
}

Squared Euclidean distance

double * geometric_median_with_square(double **points, int N, int dim) {
    for (d = 0; d < dim; d++) {
        qsort(points, N, sizeof(double *), compare);
        double T = 0;
        for (int i = 0; i < N; i++) {
            T += points[i][d];
        }
        for (int i = 0; i < N; i++) {
            double v = points[i][d];
            points[i][dim] += v * (N * v - 2 * T);
        }
    }
    return min(points, N, dim);
}

Shorter explanation: Pretty much the same approach as the previous, but with a slightly more complicated derivation. Say TT = v_0^2 + .. + v_(N-1)^2 we get TT + N * v_i^2 - 2 * v_i^2 * T. Again TT is added to all so it can be left out. More explanation on request.

Manatarms answered 19/11, 2013 at 9:41 Comment(0)
B
2

I implemented the Weiszfeld method (I know it's not what you are looking for, but it may help to aproximate your Point), the complexity is O(N*M/k) where N is the number of points, M the dimension of the points (in your case is 2) and k is the error desired:

https://github.com/j05u3/weiszfeld-implementation

Biquadrate answered 3/9, 2015 at 17:36 Comment(0)
H
2

Step 1: Sort the points collection by x-dimension (nlogn)
Step 2: Calculate the x-distance between each point and all points TO THE LEFT of it:

xLDist[0] := 0
for i := 1 to n - 1
       xLDist[i] := xLDist[i-1] + ( ( p[i].x - p[i-1].x ) * i)

Step 3: Calculate the x-distance between each point and all points TO THE RIGHT of it:

xRDist[n - 1] := 0
for i := n - 2 to 0
       xRDist[i] := xRDist[i+1] + ( ( p[i+1].x - p[i].x ) * i)  

Step 4: Sum both up you'll get the total x-distance from each point to the other N-1 points

for i := 0 to n - 1
       p[i].xDist = xLDist[i] + xRDist[i]

Repeat Step 1,2,3,4 with the y-dimension to get p[i].yDist

The point with the smallest sum of xDist and yDist is the answer

Total Complexity O(nlogn)

Answer in C++

Further explanation:
The idea is to reuse the already computed total distance of the previous point.
Lets say we have 3 point ABCD sorted, we see that the total left distance of D to the others before it are:

AD + BD + CD = (AC + CD) + (BC + CD) + CD = AC + BC + 3CD

In which (AC + BC) is the total left distance of C to the others before it, we took advantage of this and only need to compute ldist(C) + 3CD

Homovec answered 5/4, 2016 at 1:23 Comment(0)
R
0

You can solve the problem as a convex programming (The objective function is not always convex). The convex program can be solved using an iterative such as L-BFGS. The cost for each iteration is O(N) and usually the number of required iteration is not large. One important point to reduce the number of required iterations is that we know the optimum answer is one of the point in the input. So, the optimization can be stopped when its answer become close to one of the input points.

Ringed answered 19/10, 2012 at 20:51 Comment(1)
Please elaborate, I didn't understand even an iota of what you just said! A pseudo-code, along with some explanation and commenting, will be great.Shiest
S
-1

The answer we need to find is the geometric median

Code in c++

#include <bits/stdc++.h>
using namespace std;
int main()
{
    int n;
    cin >> n;

    int a[n],b[n];
    for(int i=0;i<n;i++) 
        cin >> a[i] >> b[i];
    int res = 0;
    sort(a,a+n);
    sort(b,b+n);

    int m1 = a[n/2];
    int m2 = b[n/2];

    for(int i=0;i<n;i++) 
        res += abs(m1 - a[i]);
    for(int i=0;i<n;i++) 
        res += abs(m2 - b[i]);

    cout << res << '\n';
}
Siphonostele answered 27/1, 2021 at 18:18 Comment(3)
what is the answer to Is there any efficient algorithm to solve this problem, other than the naive O(N^2) one? ?Menses
This is O(nlgn) complexietySiphonostele
I fixed your code so that it least it compiles. Still would benefit from some sort of explanation.Skit

© 2022 - 2024 — McMap. All rights reserved.