Why are KD-trees so damn slow for nearest neighbor search in point sets?

L

5

13

I am using CGAL's (the latest) KD-tree implementation for searching nearest neighbors in point sets. And also Wikipedia and other resources seem to suggest that KD-trees are the way to go. But somehow they are too slow and Wiki also suggests their worst-case time of O(n), which is far from ideal.

[BEGIN-EDIT] I am now using "nanoflann", which is about 100-1000 times faster than the equivalent in CGAL for K-neighbor search. And I am using "Intel Embree" for raycasting, which is about 100-200 times faster than CGAL's AABB trees. [END-EDIT]

My task looks like this:

I have a HUGE point set, say like up to a few 100 mio. points!! and their distribution is on surfaces of triangulated geometry (yes, a photon tracer). So one could say that their distribution is 2D in 3D space, because it is sparse in 3D but dense when looking at the surfaces... This could be the problem right? Because to me this seems to trigger the worst-case performance of a KD-tree which probably could deal much better with 3D dense point sets...

CGAl is quite good at what it does, so I have a bit doubt that their implementation just sucks. Their AABB tree I am using for raytracing burns a straight billion raytraces in a few mintues in the ground... That is remarkable I guess. But their KD-tree on the other hand can't even deal with a mio. points and 250k samples (point queries) in any reasonable time...

I came up with two solutions which kick the crap out of KD-trees:

1) Use texture maps to store the photons in a linked list on the geometry. This is always an O(1) operation, since you have to do the raycast anyway...

2) Use view dependent sliced hashsets. That is the farther away you get, the more coarse the hashsets get. So basically you can think of a 1024x1024x1024 raster in NDC coordinates, but with hashsets, to save memory in sparse areas. This basically has O(1) complexity and can be parallelized efficiently, both for inserts (micro-sharding) and queries (lock-free).

The former solution has the disadvantage that it is close to impossible to average over neighboring photon lists, which is important in darker regions to avoid noise. The latter solution doesn't have this problem and should be on par feature wise with KD-trees, just that it has O(1) worst case performance, lol.

So what do you think? Bad KD-tree implementation? If not, is there something "better" than a KD-tree for bounded nearest neighbor queries? I mean I have nothing against my second solution above, but a "proven" data-structure that delivers similar performance would be nicer!

Thanks!

Here is the code (not compilable though) that I used:

#include "stdafx.h"
#include "PhotonMap.h"

#pragma warning (push)
    #pragma warning (disable: 4512 4244 4061)
    #pragma warning (disable: 4706 4702 4512 4310 4267 4244 4917 4820 4710 4514 4365 4350 4640 4571 4127 4242 4350 4668 4626)
    #pragma warning (disable: 4625 4265 4371)

    #include <CGAL/Simple_cartesian.h>
    #include <CGAL/Orthogonal_incremental_neighbor_search.h>
    #include <CGAL/basic.h>
    #include <CGAL/Search_traits.h>
    #include <CGAL/point_generators_3.h>

#pragma warning (pop)

struct PhotonicPoint
{
    float vec[3];
    const Photon* photon;

    PhotonicPoint(const Photon& photon) : photon(&photon) 
    { 
        vec[0] = photon.hitPoint.getX();
        vec[1] = photon.hitPoint.getY();
        vec[2] = photon.hitPoint.getZ();
    }

    PhotonicPoint(Vector3 pos) : photon(nullptr) 
    { 
        vec[0] = pos.getX();
        vec[1] = pos.getY();
        vec[2] = pos.getZ();
    }

    PhotonicPoint() : photon(nullptr) { vec[0] = vec[1] = vec[2] = 0; }

    float x() const { return vec[0]; }
    float y() const { return vec[1]; }
    float z() const { return vec[2]; }
    float& x() { return vec[0]; }
    float& y() { return vec[1]; }
    float& z() { return vec[2]; }

    bool operator==(const PhotonicPoint& p) const
    {
        return (x() == p.x()) && (y() == p.y()) && (z() == p.z()) ;
    }

    bool operator!=(const PhotonicPoint& p) const 
    { 
        return ! (*this == p); 
    }
}; 

namespace CGAL 
{
    template <>
    struct Kernel_traits<PhotonicPoint> 
    {
        struct Kernel 
        {
            typedef float FT;
            typedef float RT;
        };
    };
}

struct Construct_coord_iterator
{
    typedef const float* result_type;

    const float* operator()(const PhotonicPoint& p) const
    { 
        return static_cast<const float*>(p.vec); 
    }

    const float* operator()(const PhotonicPoint& p, int) const
    { 
        return static_cast<const float*>(p.vec+3); 
    }
};

typedef CGAL::Search_traits<float, PhotonicPoint, const float*, Construct_coord_iterator> Traits;
typedef CGAL::Orthogonal_incremental_neighbor_search<Traits> NN_incremental_search;
typedef NN_incremental_search::iterator NN_iterator;
typedef NN_incremental_search::Tree Tree;


struct PhotonMap_Impl
{
    Tree tree;

    PhotonMap_Impl(const PhotonAllocator& allocator) : tree()
    {
        int counter = 0, maxCount = allocator.GetAllocationCounter();
        for(auto& list : allocator.GetPhotonLists())
        {
            int listLength = std::min((int)list.size(), maxCount - counter);
            counter += listLength; 
            tree.insert(std::begin(list), std::begin(list) + listLength);
        }

        tree.build();
    }
};

PhotonMap::PhotonMap(const PhotonAllocator& allocator)
{
    impl = std::make_shared<PhotonMap_Impl>(allocator);
}

void PhotonMap::Sample(Vector3 where, float radius, int minCount, std::vector<const Photon*>& outPhotons)
{
    NN_incremental_search search(impl->tree, PhotonicPoint(where));
    int count = 0;

    for(auto& p : search)
    {
        if((p.second > radius) && (count > minCount) || (count > 50))
            break;

        count++;
        outPhotons.push_back(p.first.photon);
    }

}

Limoges answered 27/2, 2013 at 23:53 Comment(18)

Is this even slower? cgal.org/Manual/latest/doc_html/cgal_manual/Triangulation_3/… – Malnutrition 28/2, 2013 at 0:4

@MarcGlisse: Could you elaborate a bit on what that means. I do a delauny triangluation to get fast point queries? I am not sure if a triangulation of a billion points is feasible. – Limoges 28/2, 2013 at 0:9

Could you show how you are using the kd-tree exactly? The Spatial Sorting package describes many options... – Malnutrition 28/2, 2013 at 0:9

I added the code. Please note that performance is lost only in CGAL code, not in my code. So maybe I am doing something wrong with the way I use this spatial sort... – Limoges 28/2, 2013 at 0:11

why aren't you using Orthogonal_k_nearest_neighbor if you know that you don't want more than 50 neighbors? – Odontoblast 1/3, 2013 at 8:9

@sloriot: At least I can test if that makes it faster but I doubt it. I only put this limit in to make sure that the alg doesn't spend too much time per sample. But even like that it is terribly slow, while my own data-structures handle thousand photons and more per sample with 1000x overall speedup lol. – Limoges 1/3, 2013 at 14:11

Shouldn't that be (p.second > radius) || to quit your search loop? Otherwise, you will spend a lot of time searching dead space. – Scalise 2/3, 2013 at 18:2

No, there shall be a minimum amount of samples. And they always be around, there are enough photons! – Limoges 2/3, 2013 at 22:11

If you can provide a data-set with query points, we can have a look at it and see why it is slow and fix it. – Odontoblast 4/3, 2013 at 6:11

@sloriot: I will come back to that offer later. The raytracer will be OpenSource so it shouldn't be too hard then to figure it out for you :). I will just leave the CGAL KD-tree in there as a "drop-in" replacement for my own data-structure... – Limoges 4/3, 2013 at 19:19

Could you give some timing for building the tree and for each queries? You said your code is 1000x faster and I find this really strange (the last time I tried CGAL was even faster than ANN). I suspect there is something independent from the search that is consuming resources. – Odontoblast 4/3, 2013 at 20:6

@sloriot: I don't have time for that now. I maybe next week. But it is not that surprising, since my data-structure is O(1) for all queries, and if you realise that I am talking about a scale of 100 mio. points and millions of queries... Well. PS: Of course I didn't test it with CGAL at that scale, it would be still be running when the universe ran out of energy I suppose. – Limoges 5/3, 2013 at 4:38

@sloriot: Back to your request. I am now using "nanoflann", which is about 100-1000 times faster than the equivalent in CGAL for K-neighbor search. And I am using "Intel Embree" for raycasting, which is about 100-200 times faster than CGAL's AABB trees. – Limoges 16/3, 2013 at 3:21

We did a simple benchmark on 1M points with 50 nearest neighbor (eps=0) and the performance was equivalent. So please provide a benchmark showing this discrepancy. – Odontoblast 29/3, 2013 at 8:13

Could you please give me the code for this then? Maybe I missed something in my CGAL code... Well I posted the code above. That is the code that is 1000 times slower. I would like to see yours to see if I did something wrong. Right now I have no time for additional benchmarks. – Limoges 1/4, 2013 at 17:59

Sure, here it is gist.github.com/sloriot/5291656 – Odontoblast 2/4, 2013 at 11:50

@sloriot: Thanks for the code. Seems you made quite some effort ;). I will try that out within a week or so. I will also try to export my sample data and queries. – Limoges 3/4, 2013 at 16:42

I did a benchmark with 500k points generated from a range scanner comparing nanoFlann and CGAL, using random queries within the bounding box of the points, and except for trivial amounts of queries CGAL consistently won. Surprisingly CGAL also had vastly faster tree build speed. The problem with your code is probably your usage of NN_incremental_search instead of Orthogonal_k_nearest_neighbor. – Ulotrichous 29/4, 2019 at 12:49

S

4

From my experience, implementation quality varies widely, unfortunately. I have, however, never looked at the CGAL implementation.

The worst case for the k-d-tree usually is when due to incremental changes it becomes too unbalanced, and should be reloaded.

However, in general such trees are most efficient when you don't know the data distribution.

In your case it sounds as if a simple grid-based approach may be the best choice. If you want, you can consider a texture to be a dense 2d grid. So maybe you can find a 2d projection where a grid works good, and then intersect with this projection.

Scalise answered 1/3, 2013 at 23:25 Comment(6)

Well you know, I would have never considered KD-trees if they weren't recommended in a book about photon-tracing for this task. So I was assuming they would be suited and wondering if maybe I am doing something wrong! – Limoges 2/3, 2013 at 4:3

Was it a practical book, or a theoretical book? Plus, the implementation really matters. I have seen kd-trees perform really well, and I've seen them add little to the performance, unfortunately. – Scalise 2/3, 2013 at 10:54

The book definitely had practical aim, because it ships with a photon tracer and uses source code to illustrate all stuff. That's why I bought it ;). But of course it also covers the theory in great detail. So then do you know a good KD-tree implementation (or pointset search implementation, since I don't care about the datastructure)?? – Limoges 2/3, 2013 at 16:19

I'm actually less convinced that a KD-tree is the appropriate thing for your use case. I'm not sure why you need to call that slow sample method (but I'm not into photon tracing). Shouldn't you be simulating photon rays hitting some geometry (and eventually your camera), instead of managing a huge data structure of photons and doing nearest-neighbor queries there? – Scalise 2/3, 2013 at 18:6

"eventually hitting your camera"... No that won't work. You trace photons until the "last mile". And after that you shoot sample rays from the camera into the scene, collecting the visible photons. You need them all for various reasons. I am not gonna write a book here but in short you need to be able to track the path back to all light sources and you need all that directional and color info in each photon to sample each screen point according to a gauss like distribution of how each photons contributes to the image with its specific material shaders (dependening on what it hit) – Limoges 2/3, 2013 at 22:14

@Limoges pbrt, is that the book that you are referring to? – Leshalesher 5/5, 2013 at 4:40

A

13

Answers are not the place to ask questions, but your question is not a question, but a statement that the kd-tree of CGAL sucks.

Reading 1.8mio points of a geological data model, and computing the 50 clostest points for each of these points has the following performance on my Dell Precision, Windows7, 64bit, VC10:

reading the points from a file: 10 sec
Construction of the tree 1.3 sec
1.8 mio queries reporting the 50 closest points: 17 sec

Do you have similar performances. Would you expect a kd-tree to be faster?

Also I am wondering where your query points are, that is close to the surface, or close to the skeleton.

Araarab answered 5/3, 2013 at 15:24 Comment(6)

Yep these performance results (although they won't match but are in the same "league") are what I get. And it is at least 1000 times to slow to be feasible for me (assuming that is scales linearly up to 100 mio points, which it won't LOL)... – Limoges 5/3, 2013 at 19:50

So what I need is 100 mio. points (that is my already computed ray-intersections), 2 mio. queries for FullHD (maybe even multiples of that for Supersampling/Antialiasing), with unlimited neightbors, but a radius that is apx. 1/1000th of scene width, in maybe 20 seconds. Everything else is a waste of time, literally. – Limoges 5/3, 2013 at 19:56

Alright. Somebody explain to me just what the hell a mio is. Is this some bizarre way to represent the metric M, i.e. 1e6, 1,000,000? Please use that. People will then take you seriously. But what do I know. It could be a local convention or something. – Chickabiddy 13/9, 2013 at 2:50

@StevenLu: Yeah it could be an alien after all... A little phantasy won't hurt and we don't all live in US and know that they don't know "mio." as abbr. for "million". – Limoges 27/4, 2014 at 20:43

I don't also live in US or any english speaking country and I never knew mio. is the abbreviation for million. Which country uses this? – Merger 8/12, 2014 at 23:33

@Limoges "Mio" is most certainly not a U.S. abbreviation for million (at least not in the midwest). According to Wikipedia, "mio" is used in "German, Swiss and Dutch markets". – Circumcise 14/5, 2016 at 1:57

B

8

I have done some research into fast KD-tree implementations a few months ago, and I agree with Anony-Mousse that quality (and "weight" of libraries) varies strongly. Here are some of my findings:

kdtree2 is a little known and pretty straightforward KD-tree implementation I found to be quite fast for 3D problems, especially if you allow it to copy and re-sort your data. Also, it is small and very easy to incorporate and adapt. Here is a paper on the implementation by the author (don't be put off by the mentioning of Fortran in the title). This is the library I ended up using. My colleagues benchmarked its speed for 3D points against VLFeat's KD-trees and another library I don't remember (many FLANN, see below) and it won.

FLANN has a reputation of being fast, and is used and recommended quite often recently. It aims at the higher dimensional case, where it offers approximate algorithms, but is also used in the Point Cloud Library which deals with 3D problems.

I did not experiment with CGAL since I found the library to be too heavyweight. I agree that CGAL has a good reputation, but I feel it occasionally suffers from oversophistication.

Bevatron answered 5/3, 2013 at 17:14 Comment(1)

Thanks for this overview! But I think the issue seems to lie in the data-structure, not its implementation. So the KD-tree itself seems to be too slow, regardless of how good you implement it (usually you will only gain a constant factor anyway). – Limoges 6/3, 2013 at 18:39

S

4

From my experience, implementation quality varies widely, unfortunately. I have, however, never looked at the CGAL implementation.

The worst case for the k-d-tree usually is when due to incremental changes it becomes too unbalanced, and should be reloaded.

However, in general such trees are most efficient when you don't know the data distribution.

In your case it sounds as if a simple grid-based approach may be the best choice. If you want, you can consider a texture to be a dense 2d grid. So maybe you can find a 2d projection where a grid works good, and then intersect with this projection.

Scalise answered 1/3, 2013 at 23:25 Comment(6)

Well you know, I would have never considered KD-trees if they weren't recommended in a book about photon-tracing for this task. So I was assuming they would be suited and wondering if maybe I am doing something wrong! – Limoges 2/3, 2013 at 4:3

Was it a practical book, or a theoretical book? Plus, the implementation really matters. I have seen kd-trees perform really well, and I've seen them add little to the performance, unfortunately. – Scalise 2/3, 2013 at 10:54

The book definitely had practical aim, because it ships with a photon tracer and uses source code to illustrate all stuff. That's why I bought it ;). But of course it also covers the theory in great detail. So then do you know a good KD-tree implementation (or pointset search implementation, since I don't care about the datastructure)?? – Limoges 2/3, 2013 at 16:19

I'm actually less convinced that a KD-tree is the appropriate thing for your use case. I'm not sure why you need to call that slow sample method (but I'm not into photon tracing). Shouldn't you be simulating photon rays hitting some geometry (and eventually your camera), instead of managing a huge data structure of photons and doing nearest-neighbor queries there? – Scalise 2/3, 2013 at 18:6

"eventually hitting your camera"... No that won't work. You trace photons until the "last mile". And after that you shoot sample rays from the camera into the scene, collecting the visible photons. You need them all for various reasons. I am not gonna write a book here but in short you need to be able to track the path back to all light sources and you need all that directional and color info in each photon to sample each screen point according to a gauss like distribution of how each photons contributes to the image with its specific material shaders (dependening on what it hit) – Limoges 2/3, 2013 at 22:14

@Limoges pbrt, is that the book that you are referring to? – Leshalesher 5/5, 2013 at 4:40

B

0

Take a look at ApproxMVBB library under the MPL licence:

https://github.com/gabyx/ApproxMVBB:

The kdTree implementation should be comparable to PCL(FLANN) and might be even faster. (tests with PCL seemed to be faster with my implementation!)

Diclaimer: I am the owner of this library and by no means this library claims to be any faster and serious performance tests have not been conducted yet, but I am using this library sucessfully in granular rigid body dynamics where speed is king! However, this library is very small, and the kdTree implementation is very generic (see the examples) and lets you have custom splitting heurstics and other fancy stuff :-).

Similar improvements and considerations as in the nanoflann (direct data access etc., generic data, n-dimensional ) are implemented ... (see the KdTree.hpp) header.

Some Updates on Timing:
The example kdTreeFilteringcontains some small benchmarks: The standord bunny with 35947 points is loaded (fully working example in the repo out of the box) :

The results:

Bunny.txt

Loaded: 35947 points 
KDTree::  Exotic point traits , Vector3* +  id, start: =====
KdTree build took: 3.1685ms.
Tree Stats: 
     nodes      : 1199
     leafs      : 600
     tree level : 11
     avg. leaf data size : 29.9808
     min. leaf data size : 0
     max. leaf data size : 261
     min. leaf extent    : 0.00964587
     max. leaf extent    : 0.060337
SplitHeuristics Stats: 
     splits            : 599
     avg. split ratio  (0,0.5] : 0.5
     avg. point ratio  [0,0.5] : 0.22947
     avg. extent ratio (0,1]   : 0.616848
     tries / calls     : 599/716 = 0.836592
Neighbour Stats (if computed): 

     min. leaf neighbours    : 6
     max. leaf neighbours    : 69
     avg. leaf neighbours    : 18.7867
(Built with methods: midpoint, no split heuristic optimization loop)


Saving KdTree XML to: KdTreeResults.xml
KDTree:: Simple point traits , Vector3 only , start: =====
KdTree build took: 18.3371ms.
Tree Stats: 
     nodes      : 1199
     leafs      : 600
     tree level : 10
     avg. leaf data size : 29.9808
     min. leaf data size : 0
     max. leaf data size : 306
     min. leaf extent    : 0.01
     max. leaf extent    : 0.076794
SplitHeuristics Stats: 
     splits            : 599
     avg. split ratio  (0,0.5] : 0.448302
     avg. point ratio  [0,0.5] : 0.268614
     avg. extent ratio (0,1]   : 0.502048
     tries / calls     : 3312/816 = 4.05882
Neighbour Stats (if computed): 

     min. leaf neighbours    : 6
     max. leaf neighbours    : 43
     avg. leaf neighbours    : 21.11
(Built with methods: midpoint, median,geometric mean, full split heuristic optimization)

Lucy.txt model with 14 million points:

Loaded: 14027872 points 
KDTree::  Exotic point traits , Vector3* +  id, start: =====
KdTree build took: 3123.85ms.
Tree Stats: 
         nodes      : 999999
         leafs      : 500000
         tree level : 25
         avg. leaf data size : 14.0279
         min. leaf data size : 0
         max. leaf data size : 159
         min. leaf extent    : 2.08504
         max. leaf extent    : 399.26
SplitHeuristics Stats: 
         splits            : 499999
         avg. split ratio  (0,0.5] : 0.5
         avg. point ratio  [0,0.5] : 0.194764
         avg. extent ratio (0,1]   : 0.649163
         tries / calls     : 499999/636416 = 0.785648
(Built with methods: midpoint, no split heuristic optimization loop)

KDTree:: Simple point traits , Vector3 only , start: =====
KdTree build took: 7766.79ms.
Tree Stats: 
     nodes      : 1199
     leafs      : 600
     tree level : 10
     avg. leaf data size : 11699.6
     min. leaf data size : 0
     max. leaf data size : 35534
     min. leaf extent    : 9.87306
     max. leaf extent    : 413.195
SplitHeuristics Stats: 
     splits            : 599
     avg. split ratio  (0,0.5] : 0.297657
     avg. point ratio  [0,0.5] : 0.492414
     avg. extent ratio (0,1]   : 0.312965
     tries / calls     : 5391/600 = 8.985
Neighbour Stats (if computed): 

     min. leaf neighbours    : 4
     max. leaf neighbours    : 37
     avg. leaf neighbours    : 12.9233
(Built with methods: midpoint, median,geometric mean, full split heuristic optimization)

Take care about the interpretation! and look at the settings used in the example File. However comparing with results from other people: ~3100ms for 14*10⁶ points is quite slick :-)

Processor used: Intel® Core™ i7 CPU 970 @ 3.20GHz × 12 , 12GB Ram

Broadcasting answered 16/9, 2015 at 13:12 Comment(2)

can you give an test, how long it takes in ms, with number of points, your cpu, system etc. – Britney 21/9, 2015 at 15:11

this highly depends on the use case for your application, if you want simple mid point splitting , without crazy splitting heurisitc optimization, then it is much fastern then evaluating the best split technique for each potential split, you can compile the example in the project (I added some timeing function, the Lucy model is included in the submodule as well to test against 14Million points) I updated the answer a little bit – Broadcasting 21/9, 2015 at 16:46

S

0

If the kdtree is fast for small sets, but "slow" for large (>100000?) sets, you may be suffering from flushing the processor caches. If the top few nodes are interleaved with rarely used leaf nodes then you will fit fewer heavily used nodes in the processor cache(s). This can be improved by minimising the size of the nodes and careful layout of the nodes in memory.. but eventually you will be flushing a fair number of nodes anyway. You can end up with access to main memory being the bottle-neck.

Valgrind tells me one version of my code is 5% fewer instructions, but I believe the stopwatch when it tells me it is about 10% slower for the same input. I suspect valgrind doing a full cache simulation would tell me the right answer.

If you are multi-threaded, you may want to encourage the threads to be doing searches in similar areas to reuse the cache... that assumes a single multi-core processor - multiple processors might want the opposite approach.

Hint: You pack more 32bit indexes in memory than 64bit pointers.

Standush answered 22/2, 2018 at 21:54 Comment(0)

Recommended topics

Hot tags