Efficient minimal spanning tree in metric space

Asked 17/1, 2011 at 17:51 Answered 17/1, 2011 at 19:12

I have a large set of points (n > 10000 in number) in some metric space (e.g. equipped with Jaccard Distance). I want to connect them with a minimal spanning tree, using the metric as the weight on the edges.

Is there an algorithm that runs in less than O(n²) time?
If not, is there an algorithm that runs in less than O(n²) average time (possibly using randomization)?
If not, is there an algorithm that runs in less than O(n²) time and gives a good approximation of the minimum spanning tree?
If not, is there a reason why such algorithm can't exist?

Thank you in advance!

Edit for the posters below: Classical algorithms for finding minimal spanning tree don't work here. They have an E factor in their running time, but in my case E = n² since I actually consider the complete graph. I also don't have enough memory to store all the >49995000 possible edges.

Barnabas answered 17/1, 2011 at 17:51 Comment(4)

Have you read at least this en.wikipedia.org/wiki/Minimum_spanning_tree#Algorithms? – Dialytic 17/1, 2011 at 18:11

@Nikolai: of course I did. And many papers too. – Barnabas 17/1, 2011 at 18:16

You wouldn't need to "store" your 10^8 edges. You would need a bit vector to be able to mark visited edges, but this bit vector would only use 12 MB or so, which seems affordable as far as memory is concerned. – Wellmeaning 17/1, 2011 at 18:24

@Sven: a) 10000 vertices is a lower bound. b) Kruskal needs them to be stored and sorted. – Barnabas 17/1, 2011 at 18:33

Apparently, according to this: Estimating the weight of metric minimum spanning trees in sublinear time there is no deterministic o(n^2) (note: smallOh, which is probably what you meant by less than O(n^2), I suppose) algorithm. That paper also gives a sub-linear randomized algorithm for the metric minimum weight spanning tree.

Also look at this paper: An optimal minimum spanning tree algorithm which gives an optimal algorithm. The paper also claims that the complexity of the optimal algorithm is not yet known!

The references in the first paper should be helpful and that paper is probably the most relevant to your question.

Hope that helps.

Lach answered 17/1, 2011 at 18:15 Comment(7)

Thanks, but these papers are not freely available. Also what you write here sounds a bit contradictional. Is 'linear time' here is measured as the number of edges or the number of vertices? – Barnabas 17/1, 2011 at 18:30

@ybungalo: Linear time in vertices. Sorry I can't help you with getting those papers. You have the titles, you should be able to find them in some decent library. – Lach 17/1, 2011 at 18:31

The first paper is available for free from Brown university itself. Unfortunately it's linear in the number of edges as stated in the 4th paragraph of the introduction. Well, it can't do better without knowing that the weight come from a metric, since it must read all the edges. – Barnabas 17/1, 2011 at 18:50

No need for a library. I don't know about other fields, but basically all CS papers can be found for free. Go to scholar.google.com and type in the title of the paper. On the correct result there will be a small link "see all xx versions". Click that. This will take you to results with that title which will almost always contain pdf or ps links. You can find all three of these papers this way. – Otherness 17/1, 2011 at 18:50

@Ybung: My mistake (deleted from answer), glad you actually found the paper. Your logic of having to read all edges is bogus, though. It is a randomized algorithm, after all. Anyway good luck with your search. – Lach 17/1, 2011 at 18:58

@Moron: nope, my logic is right. It's randomized to give an expected linear running time, but it does find an optimal solution, not an approximation (consider a quicksort with random pivot). To find an optimal solution you always need to consider all the edges. By the way that's also the reason the last paper doesn't help, it's for general graphs too. However, the first one says that there is no known algorithm even for euclidean spaces... and this paper is just 2 years old... :( – Barnabas 17/1, 2011 at 19:9

@ybung: I was talking about general randomized algorithms (which need not always give the optimal solution), not in reference to that paper, which we had already established was linear in edges. Here is a randomized algorithm which gives the minimum of an array with probability 1/2: Choose n/2 random elements and pick the minimum of those. Does not look at all. If you expect an optimal solution always, then yes, obviously you have to look at all, as you always need to find the edge with minimum weight. Anyway... – Lach 17/1, 2011 at 19:14

When I was looking at a very similar problem 3-4 years ago, I could not find an ideal solution in the literature I looked at.

The trick I think is to find a "small" subset of "likely good" edges, which you can then run plain old Kruskal on. In general, it's likely that many MST edges can be found among the set of edges that join each vertex to its k nearest neighbours, for some small k. These edges might not span the graph, but when they don't, each component can be collapsed to a single vertex (chosen randomly) and the process repeated. (For better accuracy, instead of picking a single representative to become the new "supervertex", pick some small number r of representatives and in the next round examine all r^2 distances between 2 supervertices, choosing the minimum.)

k-nearest-neighbour algorithms are quite well-studied for the case where objects can be represented as vectors in a finite-dimensional Euclidean space, so if you can find a way to map your objects down to that (e.g. with multidimensional scaling) then you may have luck there. In particular, mapping down to 2D allows you to compute a Voronoi diagram, and MST edges will always be between adjacent faces. But from what little I've read, this approach doesn't always produce good-quality results.

Otherwise, you may find clustering approaches useful: Clustering large datasets in arbitrary metric spaces is one of the few papers I found that explicitly deals with objects that are not necessarily finite-dimensional vectors in a Euclidean space, and which gives consideration to the possibility of computationally expensive distance functions.

Fosse answered 17/1, 2011 at 19:12 Comment(8)

This summarizes what I was thinking -- unless you can somehow use your distance metric to rule out large numbers of edges, you cannot avoid the O(n^2) running time. – Postoperative 17/1, 2011 at 19:15

@Nathan: If you map down, you can avoid n^2 distance calculations for the k-nearest-neighbour query because you build some sort of index (in less than O(n^2) time). – Fosse 17/1, 2011 at 19:21

@j_random: ironically I wanted to use it to build slim-trees... :P – Barnabas 17/1, 2011 at 19:31

@j_random_hacker: Right -- I was referring to the minimum spanning tree bound, not for the nearest-neighbor query. It is only because/if neighbor queries can be done quickly that you can do better than n^2. – Postoperative 17/1, 2011 at 19:34

@Nathan: Well I think O(n^2) may be required for an exact MST (it is a complete graph after all), but my main suggestion was that you might be able to get something pretty close to the MST by examining just O(kn) edges for some small k. (The OP is happy with an approximation.) – Fosse 17/1, 2011 at 19:42

@j_random_hacker: In the worst-case yes. But if you simply increase k if more edges are needed from a particular vertex, for many graphs you could probably do much better than n^2. The worst case would be when all edges are of approximately equal length. – Postoperative 17/1, 2011 at 20:26

@Nathan @j_random: for a general MST it's impossible to do better than O(n^2), but for a metric MST it may be possible. The point is that triangle inequality allows you to deduce the bounds on the weights of some edges so you don't need to process them at all. The question is HOW to do it. – Barnabas 17/1, 2011 at 20:56

It's like the Voronoi you mentioned. It uses the knowledge that it works in euclidean space and thus improves the performance. It's true in any finite dimension (not only 2D) that you can build it in linearithmic time and use it for MST. – Barnabas 17/1, 2011 at 21:1

Recommended topics

Hot tags