Efficient (and well explained) implementation of a Quadtree for 2D collision detection [closed]
Asked Answered
M

7

108

I've been working on adding a Quadtree to a program that I'm writing, and I can't help but notice that there are few well explained/performing tutorials for the implementation that I'm looking for.

Specifically, a list of the methods and pseudocode for how to implement them (or just a description of their processes) that are commonly used in a Quadtree (retrieve, insert, remove, etc.) is what I'm looking for, along with maybe some tips to improve performance. This is for collision detection, so it'd be best to be explained with 2d rectangles in mind, as they are the objects that will be stored.

Mighell answered 30/1, 2017 at 21:50 Comment(4)
StackOverflow is really more of a site where once you have an implementation and are struggling to complete or hit a wall you can ask for help. I would recommend attempting to do research. For example, googling quadtree 2d collision detection brought me to this question which will likely get your's closed as a duplicate. #4982366Daveen
As a matter of fact, I have been extensively looking at as many results as I could. That one didn't go into enough detail - it just gives a general idea without looking at any specifics.Mighell
That question has a comment that literally provides a working example in Java: github.com/ClickerMonkey/steerio/blob/master/Java/src/org/… Then there are several answers to that question that go into incredible detail with links to other resources and discussions such as this one https://mcmap.net/q/204885/-quadtree-for-2d-collision-detectionDaveen
Please, take a look at the book "Geometric Data Structures for Computer Graphics" by Elmar Langetepe and Gabriel Zachmann for more details and tips: dl.acm.org/doi/book/10.5555/1121721Potluck
L
260
  1. Efficient Quadtrees

All right, I'll take a shot at this. First a teaser to show the results of what I'll propose involving 20,000 agents (just something I whipped up real quick for this specific question):

enter image description here

The GIF has extremely reduced frame rate and significantly lower res to fit the 2 MB maximum for this site. Here's a video if you want to see the thing at close to full speed: https://streamable.com/3pgmn.

And a GIF with 100k though I had to fiddle with it so much and had to turn off the quadtree lines (didn't seem to want to compress as much with them on) as well as change the way the agents looked to get it to fit in 2 megabytes (I wish making a GIF was as easy as coding a quadtree):

enter image description here

The simulation with 20k agents takes ~3 megabytes of RAM. I can also easily handle 100k smaller agents without sacrificing frame rates, though it leads to a bit of a mess on the screen to the point where you can barely see what's going on as in the GIF above. This is all running in just one thread on my i7 and I'm spending almost half the time according to VTune just drawing this stuff on the screen (just using some basic scalar instructions to plot things one pixel at a time in CPU).

And here's a video with 100,000 agents though it's hard to see what's going on. It's kind of a big video and I couldn't find any decent way to compress it without the whole video turning into a mush (might need to download or cache it first to see it stream at reasonable FPS). With 100k agents the simulation takes around 4.5 megabytes of RAM and the memory use is very stable after running the simulation for about 5 seconds (stops going up or down since it ceases to heap allocate). Same thing in slow motion.

Efficient Quadtree for Collision Detection

All right, so actually quadtrees are not my favorite data structure for this purpose. I tend to prefer a grid hierarchy, like a coarse grid for the world, a finer grid for a region, and an even finer grid for a sub-region (3 fixed levels of dense grids, and no trees involved), with row-based optimizations so that a row that has no entities in it will be deallocated and turned into a null pointer, and likewise completely empty regions or sub-regions turned into nulls. While this simple implementation of the quadtree running in one thread can handle 100k agents on my i7 at 60+ FPS, I've implemented grids that can handle a couple million agents bouncing off each other every frame on older hardware (an i3). Also I always liked how grids made it very easy to predict how much memory they'll require, since they don't subdivide cells. But I'll try to cover how to implement a reasonably efficient quadtree.

Note that I won't go into the full theory of the data structure. I'm assuming that you already know that and are interested in improving performance. I'm also just going into my personal way of tackling this problem which seems to outperform most of the solutions I find online for my cases, but there are many decent ways and these solutions are tailor-fitted to my use cases (very large inputs with everything moving every frame for visual FX in films and television). Other people probably optimize for different use cases. When it comes to spatial indexing structures in particular, I really think the efficiency of the solution tells you more about the implementer than the data structure. Also the same strategies I'll propose to speeding things up also apply in 3 dimensions with octrees.

Node Representation

So first of all, let's cover the node representation:

// Represents a node in the quadtree.
struct QuadNode
{
    // Points to the first child if this node is a branch or the first
    // element if this node is a leaf.
    int32_t first_child;

    // Stores the number of elements in the leaf or -1 if it this node is
    // not a leaf.
    int32_t count;
};

It's a total of 8 bytes, and this is very important as it's a key part of the speed. I actually use a smaller one (6 bytes per node) but I'll leave that as an exercise to the reader.

You can probably do without the count. I include that for pathological cases to avoid linearly traversing the elements and counting them each time a leaf node might split. In most common cases a node shouldn't store that many elements. However, I work in visual FX and the pathological cases aren't necessarily rare. You can encounter artists creating content with a boatload of coincident points, massive polygons that span the entire scene, etc, so I end up storing a count.

Where are the AABBs?

So one of the first things people might be wondering is where the bounding boxes (rectangles) are for the nodes. I don't store them. I compute them on the fly. I'm kinda surprised most people don't do that in the code I've seen. For me, they're only stored with the tree structure (basically just one AABB for the root).

That might seem like it'd be more expensive to be computing these on the fly, but reducing the memory use of a node can proportionally reduce cache misses when you traverse the tree, and those cache miss reductions tend to be more significant than having to do a couple of bitshifts and some additions/subtractions during traversal. Traversal looks like this:

static QuadNodeList find_leaves(const Quadtree& tree, const QuadNodeData& root, const int rect[4])
{
    QuadNodeList leaves, to_process;
    to_process.push_back(root);
    while (to_process.size() > 0)
    {
        const QuadNodeData nd = to_process.pop_back();

        // If this node is a leaf, insert it to the list.
        if (tree.nodes[nd.index].count != -1)
            leaves.push_back(nd);
        else
        {
            // Otherwise push the children that intersect the rectangle.
            const int mx = nd.crect[0], my = nd.crect[1];
            const int hx = nd.crect[2] >> 1, hy = nd.crect[3] >> 1;
            const int fc = tree.nodes[nd.index].first_child;
            const int l = mx-hx, t = my-hx, r = mx+hx, b = my+hy;

            if (rect[1] <= my)
            {
                if (rect[0] <= mx)
                    to_process.push_back(child_data(l,t, hx, hy, fc+0, nd.depth+1));
                if (rect[2] > mx)
                    to_process.push_back(child_data(r,t, hx, hy, fc+1, nd.depth+1));
            }
            if (rect[3] > my)
            {
                if (rect[0] <= mx)
                    to_process.push_back(child_data(l,b, hx, hy, fc+2, nd.depth+1));
                if (rect[2] > mx)
                    to_process.push_back(child_data(r,b, hx, hy, fc+3, nd.depth+1));
            }
        }
    }
    return leaves;
}

Omitting the AABBs is one of the most unusual things I do (I keep looking for other people doing it just to find a peer and fail), but I've measured the before and after and it did reduce times considerably, at least on very large inputs, to compact the quadtree node substantially and just compute AABBs on the fly during traversal. Space and time aren't always diametrically opposed. Sometimes reducing space also means reducing time given how much performance is dominated by the memory hierarchy these days. I've even sped up some real world operations applied on massive inputs by compressing the data to quarter the memory use and decompressing on the fly.

I don't know why many people choose to cache the AABBs: whether it's programming convenience or if it's genuinely faster in their cases. Yet for data structures which split evenly down the center like regular quadtrees and octrees, I'd suggest measuring the impact of omitting the AABBs and computing them on the fly. You might be quite surprised. Of course it makes sense to store AABBs for structures that don't split evenly like Kd-trees and BVHs as well as loose quadtrees.

Floating-Point

I don't use floating-point for spatial indexes and maybe that's why I see improved performance just computing AABBs on the fly with right shifts for division by 2 and so forth. That said, at least SPFP seems really fast nowadays. I don't know since I haven't measured the difference. I just use integers by preference even though I'm generally working with floating-point inputs (mesh vertices, particles, etc). I just end up converting them to integer coordinates for the purpose of partitioning and performing spatial queries. I'm not sure if there's any major speed benefit of doing this anymore. It's just a habit and preference since I find it easier to reason about integers without having to think about denormalized FP and all that.

Centered AABBs

While I only store a bounding box for the root, it helps to use a representation that stores a center and half size for nodes while using a left/top/right/bottom representation for queries to minimize the amount of arithmetic involved.

Contiguous Children

This is likewise key, and if we refer back to the node rep:

struct QuadNode
{
    int32_t first_child;
    ...
};

We don't need to store an array of children because all 4 children are contiguous:

first_child+0 = index to 1st child (TL)
first_child+1 = index to 2nd child (TR)
first_child+2 = index to 3nd child (BL)
first_child+3 = index to 4th child (BR)

This not only significantly reduces cache misses on traversal but also allows us to significantly shrink our nodes which further reduces cache misses, storing only one 32-bit index (4 bytes) instead of an array of 4 (16 bytes).

This does mean that if you need to transfer elements to just a couple of quadrants of a parent when it splits, it must still allocate all 4 child leaves to store elements in just two quadrants while having two empty leaves as children. However, the trade-off is more than worth it performance-wise at least in my use cases, and remember that a node only takes 8 bytes given how much we've compacted it.

When deallocating children, we deallocate all four at a time. I do this in constant-time using an indexed free list, like so:

enter image description here

Except we're pooling out memory chunks containing 4 contiguous elements instead of one at a time. This makes it so we usually don't need to involve any heap allocations or deallocations during the simulation. A group of 4 nodes gets marked as freed indivisibly only to then be reclaimed indivisibly in a subsequent split of another leaf node.

Deferred Cleanup

I don't update the quadtree's structure right away upon removing elements. When I remove an element, I just descend down the tree to the child node(s) it occupies and then remove the element, but I don't bother to do anything more just yet even if the leaves become empty.

Instead I do a deferred cleanup like this:

void Quadtree::cleanup()
{
    // Only process the root if it's not a leaf.
    SmallList<int> to_process;
    if (nodes[0].count == -1)
        to_process.push_back(0);

    while (to_process.size() > 0)
    {
        const int node_index = to_process.pop_back();
        QuadNode& node = nodes[node_index];

        // Loop through the children.
        int num_empty_leaves = 0;
        for (int j=0; j < 4; ++j)
        {
            const int child_index = node.first_child + j;
            const QuadNode& child = nodes[child_index];

            // Increment empty leaf count if the child is an empty 
            // leaf. Otherwise if the child is a branch, add it to
            // the stack to be processed in the next iteration.
            if (child.count == 0)
                ++num_empty_leaves;
            else if (child.count == -1)
                to_process.push_back(child_index);
        }

        // If all the children were empty leaves, remove them and 
        // make this node the new empty leaf.
        if (num_empty_leaves == 4)
        {
            // Push all 4 children to the free list.
            nodes[node.first_child].first_child = free_node;
            free_node = node.first_child;

            // Make this node the new empty leaf.
            node.first_child = -1;
            node.count = 0;
        }
    }
}

This is called at the end of every single frame after moving all the agents. The reason I do this kind of deferred removal of empty leaf nodes in multiple iterations and not all at once in the process of removing a single element is that element A might move to node N2, making N1 empty. However, element B might, in the same frame, move to N1, making it occupied again.

With the deferred cleanup, we can handle such cases without unnecessarily removing children only to add them right back when another element moves into that quadrant.

Moving elements in my case is a straightforward: 1) remove element, 2) move it, 3) reinsert it to the quadtree. After we move all the elements and at the end of a frame (not time step, there could be multiple time steps per frame), the cleanup function above is called to remove the children from a parent which has 4 empty leaves as children, which effectively turns that parent into the new empty leaf which might then be cleaned up in the next frame with a subsequent cleanup call (or not if things get inserted to it or if the empty leaf's siblings are non-empty).

Let's look at the deferred cleanup visually:

enter image description here

Starting with this, let's say we remove some elements from the tree leaving us with 4 empty leaves:

enter image description here

At this point, if we call cleanup, it will remove 4 leaves if it finds 4 empty child leaves and turn the parent into an empty leaf, like so:

enter image description here

Let's say we remove some more elements:

enter image description here

... and then call cleanup again:

enter image description here

If we call it yet again, we end up with this:

enter image description here

... at which point the root itself turns into an empty leaf. However, the cleanup method never removes the root (it only removes children). Again the main point of doing it deferred this way and in multiple steps is to reduce the amount of potential redundant work that could occur per time step (which can be a lot) if we did this all immediately every single time an element is removed from the tree. It also helps to distribute that works across frames to avoid stutters.

TBH, I originally applied this "deferred cleanup" technique in a DOS game I wrote in C out of sheer laziness! I didn't want to bother with descending down the tree, removing elements, and then removing nodes in a bottom-up fashion back then because I originally wrote the tree to favor top-down traversal (not top-down and back up again) and really thought this lazy solution was a productivity compromise (sacrificing optimal performance to get implemented faster). However, many years later, I actually got around to implementing quadtree removal in ways that immediately started removing nodes and, to my surprise, I actually significantly made it slower with more unpredictable and stuttery frame rates. The deferred cleanup, in spite of originally being inspired by my programmer laziness, was actually (and accidentally) a very effective optimization for dynamic scenes.

Singly-Linked Index Lists for Elements

For elements, I use this representation:

// Represents an element in the quadtree.
struct QuadElt
{
    // Stores the ID for the element (can be used to
    // refer to external data).
    int id;

    // Stores the rectangle for the element.
    int x1, y1, x2, y2;
};

// Represents an element node in the quadtree.
struct QuadEltNode
{
    // Points to the next element in the leaf node. A value of -1 
    // indicates the end of the list.
    int next;

    // Stores the element index.
    int element;
};

I use an "element node" which is separate from "element". An element is only inserted once to the quadtree no matter how many cells it occupies. However, for each cell it occupies, an "element node" is inserted which indexes that element.

The element node is a singly-linked index list node into an array, and also using the free list method above. This incurs some more cache misses over storing all the elements contiguously for a leaf. However, given that this quadtree is for very dynamic data which is moving and colliding every single time step, it generally takes more processing time than it saves to seek out a perfectly contiguous representation for the leaf elements (you would effectively have to implement a variable-sized memory allocator which is really fast, and that's far from an easy thing to do). So I use the singly-linked index list which allows a free list constant-time approach to allocation/deallocation. When you use that representation, you can transfer elements from split parents to new leaves by just changing a few integers.

SmallList<T>

Oh, I should mention this. Naturally it helps if you don't heap allocate just to store a temporary stack of nodes for non-recursive traversal. SmallList<T> is similar to vector<T> except it won't involve a heap allocation until you insert more than 128 elements to it. It's similar to SBO string optimizations in the C++ standard lib. It's something I implemented and have been using for ages and it does help a lot to make sure you use the stack whenever possible.

Tree Representation

Here's the representation of the quadtree itself:

struct Quadtree
{
    // Stores all the elements in the quadtree.
    FreeList<QuadElt> elts;

    // Stores all the element nodes in the quadtree.
    FreeList<QuadEltNode> elt_nodes;

    // Stores all the nodes in the quadtree. The first node in this
    // sequence is always the root.
    std::vector<QuadNode> nodes;

    // Stores the quadtree extents.
    QuadCRect root_rect;

    // Stores the first free node in the quadtree to be reclaimed as 4
    // contiguous nodes at once. A value of -1 indicates that the free
    // list is empty, at which point we simply insert 4 nodes to the
    // back of the nodes array.
    int free_node;

    // Stores the maximum depth allowed for the quadtree.
    int max_depth;
};

As pointed out above, we store a single rectangle for the root (root_rect). All sub-rects are computed on the fly. All nodes are stored in contiguously in an array (std::vector<QuadNode>) along with the elements and element nodes (in FreeList<T>).

FreeList<T>

I use a FreeList data structure which is basically an array (and random-access sequence) that lets you remove elements from anywhere in constant-time (leaving holes behind which get reclaimed upon subsequent insertions in constant-time). Here's a simplified version which doesn't bother with handling non-trivial data types (doesn't use placement new or manual destruction calls):

/// Provides an indexed free list with constant-time removals from anywhere
/// in the list without invalidating indices. T must be trivially constructible 
/// and destructible.
template <class T>
class FreeList
{
public:
    /// Creates a new free list.
    FreeList();

    /// Inserts an element to the free list and returns an index to it.
    int insert(const T& element);

    // Removes the nth element from the free list.
    void erase(int n);

    // Removes all elements from the free list.
    void clear();

    // Returns the range of valid indices.
    int range() const;

    // Returns the nth element.
    T& operator[](int n);

    // Returns the nth element.
    const T& operator[](int n) const;

private:
    union FreeElement
    {
        T element;
        int next;
    };
    std::vector<FreeElement> data;
    int first_free;
};

template <class T>
FreeList<T>::FreeList(): first_free(-1)
{
}

template <class T>
int FreeList<T>::insert(const T& element)
{
    if (first_free != -1)
    {
        const int index = first_free;
        first_free = data[first_free].next;
        data[index].element = element;
        return index;
    }
    else
    {
        FreeElement fe;
        fe.element = element;
        data.push_back(fe);
        return static_cast<int>(data.size() - 1);
    }
}

template <class T>
void FreeList<T>::erase(int n)
{
    data[n].next = first_free;
    first_free = n;
}

template <class T>
void FreeList<T>::clear()
{
    data.clear();
    first_free = -1;
}

template <class T>
int FreeList<T>::range() const
{
    return static_cast<int>(data.size());
}

template <class T>
T& FreeList<T>::operator[](int n)
{
    return data[n].element;
}

template <class T>
const T& FreeList<T>::operator[](int n) const
{
    return data[n].element;
}

I have one which does work with non-trivial types and provides iterators and so forth but is much more involved. These days I tend to work more with trivially constructible/destructible C-style structs anyway (only using non-trivial user-defined types for high-level stuff).

Maximum Tree Depth

I prevent the tree from subdividing too much by specifying a max depth allowed. For the quick simulation I whipped up, I used 8. For me this is crucial since again, in VFX I encounter pathological cases a lot including content created by artists with lots of coincident or overlapping elements which, without a maximum tree depth limit, could want it to subdivide indefinitely.

There is a bit of fine-tuning if you want optimal performance with respect to max depth allowed and how many elements you allow to be stored in a leaf before it splits into 4 children. I tend to find the optimal results are gained with something around 8 elements max per node before it splits, and a max depth set so that the smallest cell size is a little over the size of your average agent (otherwise more single agents could be inserted into multiple leaves).

Collision and Queries

There are a couple of ways to do the collision detection and queries. I often see people do it like this:

for each element in scene:
     use quad tree to check for collision against other elements

This is very straightforward but the problem with this approach is that the first element in the scene might be in a totally different location in the world from the second. As a result, the paths we take down the quadtree could be totally sporadic. We might traverse one path to a leaf and then want to go down that same path again for the first element as, say, the 50,000th element. By this time the nodes involved in that path may have already been evicted from the CPU cache. So I recommend doing it this way:

traversed = {}
gather quadtree leaves
for each leaf in leaves:
{
     for each element in leaf:
     {
          if not traversed[element]:
          {
              use quad tree to check for collision against other elements
              traversed[element] = true                  
          }
     }
}

While that's quite a bit more code and requires we keep a traversed bitset or parallel array of some sort to avoid processing elements twice (since they may be inserted in more than one leaf), it helps make sure that we descend the same paths down the quadtree throughout the loop. That helps keep things much more cache-friendly. Also if after attempting to move the element in the time step, it's still encompassed entirely in that leaf node, we don't even need to work our way back up again from the root (we can just check that one leaf only).

Common Inefficiencies: Things to Avoid

While there are many ways to skin the cat and achieve an efficient solution, there is a common way to achieve a very inefficient solution. And I've encountered my share of very inefficient quadtrees, kd trees, and octrees in my career working in VFX. We're talking over a gigabyte of memory use just to partition a mesh with 100k triangles while taking 30 secs to build, when a decent implementation should be able to do the same hundreds of times a second and just take a few megs. There are many people whipping these up to solve problems who are theoretical wizards but didn't pay much attention to memory efficiency.

So the absolute most common no-no I see is to store one or more full-blown containers with each tree node. By full-blown container, I mean something that owns and allocates and frees its own memory, like this:

struct Node
{
     ...

     // Stores the elements in the node.
     List<Element> elements;
};

And List<Element> might be a list in Python, an ArrayList in Java or C#, std::vector in C++, etc: some data structure that manages its own memory/resources.

The problem here is that while such containers are very efficiently implemented for storing a large number of elements, all of them in all languages are extremely inefficient if you instantiate a boatload of them only to store a few elements in each one. One of the reasons is that the container metadata tends to be quite explosive in memory usage at such a granular level of a single tree node. A container might need to store a size, capacity, a pointer/reference to data it allocates, etc, and all for a generalized purpose so it might use 64-bit integers for size and capacity. As a result the metadata just for an empty container might be 24 bytes which is already 3 times larger than the entirety of the node representation I proposed, and that's just for an empty container designed to store elements in leaves.

Furthermore each container often wants to either heap/GC-allocate on insertion or require even more preallocated memory in advance (at which point it might take 64 bytes just for the container itself). So that either becomes slow because of all the allocations (note that GC allocations are really fast initially in some implementations like JVM, but that's only for the initial burst Eden cycle) or because we're incurring a boatload of cache misses because the nodes are so huge that barely any fit into the lower levels of the CPU cache on traversal, or both.

Yet this is a very natural inclination and makes intuitive sense since we talk about these structures theoretically using language like, "Elements are stored in the leaves" which suggests storing a container of elements in leaf nodes. Unfortunately it has an explosive cost in terms of memory use and processing. So avoid this if the desire is to create something reasonably efficient. Make the Node share and point to (refer to) or index memory allocated and stored for the entire tree, not for every single individual node. In actuality the elements shouldn't be stored in the leaves.

Elements should be stored in the tree and leaf nodes should index or point to those elements.

Conclusion

Phew, so those are the main things I do to achieve what is hopefully considered a decent-performing solution. I hope that helps. Note that I am aiming this at a somewhat advanced level for people who have already implemented quadtrees at least once or twice. If you have any questions, feel free to shoot.

Since this question is a bit broad, I might come and edit it and keep tweaking and expanding it over time if it doesn't get closed (I love these types of questions since they give us an excuse to write about our experiences working in the field, but the site doesn't always like them). I'm also hoping some experts might jump in with alternative solutions I can learn from and perhaps use to improve mine further.

Again quadtrees aren't actually my favorite data structure for extremely dynamic collision scenarios like this. So I probably have a thing or two to learn from people who do favor quadtrees for this purpose and have been tweaking and tuning them for years. Mostly I use quadtrees for static data that doesn't move around every frame, and for those I use a very different representation from the one proposed above.

Lucaslucca answered 18/1, 2018 at 21:7 Comment(33)
That was a very well done explanation. Thank you (even if it was nearly a year old question :)Mighell
You write about removal/moving elements in the quadtree, but I'm struggling to understand how to do that in reasonable time. Do you employ some map type to get from an element id to its respective location in the quadtree? Or am I overlooking something painfully obvious?Scalise
@Scalise In 2. - 'descend down the tree to where the leaf/leaves in which the element is stored (which you can determine using its rectangle, e.g.)'Hardeman
@Dragon Energy I'm interested in an explanation on how you're using/working with your prefered grid hierarchy. That would be great. :)Sift
I'm genuinely interested in your coarse+fine grid suggestion. How would you implement it? Could you maybe write a little bit more about it, here or somewhere else? Have you already written about it ?Philhellene
@Dragon Energy I asked here, feel free to answer #59796069Philhellene
For those wondering, the link above does contain an answer from DragonEnergy.Sublime
@Philhellene Sorry, I haven't used this site in ages! I am kind of rusty and forgetful but this COVID-19 stuff has given me more spare time. Anyway, probably it helps to just think of stuff in terms of fixed resolutions first. Like visually you can think of quadtree as like a grid which where the cells can divide into sub-grids, and sub-sub-grids, and so forth. Maybe you put a limit where it can't divide further than a pixel in size or something like that. My coarse/fine grid solutions are similar but just don't bother with trees. They just store big arrays of stuffs.Lucaslucca
@Philhellene So like a quadtree, when you imagine it visually, has a sort of variable resolution for each grid cell. Some cells will divide more, some less, and the depth of the tree is variable (might be like 16 levels deep in some cases unless you put a recursion limit). My favorite solutions are not so dissimilar but I just hardcode like a big grid, medium grid, small grid. It's like hard-coding the levels of the quadtree/octree. I always got the best performance outta those.Lucaslucca
@Philhellene It's also built bottom-up. Say I start with 1000x1000 grid. It's very fast to insert stuff and involves no traversal of links in a tree. However, it's slow for broad searches (like searching a big area of a game map) because it has to check a boatload of cells. But it's a pretty good and super easy solution to code. Now, after you have that and you want to accelerate large queries, you can build another grid that's lower in resolution, like 500x500. Then maybe 250x250. It is very similar to mipmapping for texture images.Lucaslucca
@Philhellene But I hardcoded it in the past in terms of depth (I rarely used more than three). It's not a smart data structure. It's all hard-coded. Whereas the quad-tree is all smart and adapts/re-adapts the grid resolution/tree depth based on what's inserted/removed. However, that comes at the cost of involving a linked structure and having to chase links (indices or pointers) for tree traversal and I've always found that harder to optimize. It is easier for me to handle hundreds of thousands to millions of agents with a dumber data structure that's a bit more hardcoded and not very adaptive.Lucaslucca
@Philhellene What the lower resolution grids store can be up to you to accelerate searches, but a very simple thing to do is just make them booleans or bitsets storing what cells are occupied/empty in your full-resolution grid. Then you start from the lowest resolution grid ("root grid") and avoid descending down into higher resolution grids if the lower resolution grid ("parent grid") indicates that the all the cells of the higher-resolution grid ("child grid") are unoccupied.Lucaslucca
@Philhellene I edited one of my answers (the theory one) to explain the coarser/finer grid with code to illustrate. Hope that helps!Lucaslucca
Interesting ideas here, thanks for sharing. I tried out your example C code and it's not as performant as it could be. However, the general guidelines of reducing struct sizes is a great recommendation. If I could add some lessons from my own days of testing: x) union of 2 pointers is better than using element index, size can be used to determine branch or leaf. x) Linear slab allocators that allocate sequential Element arrays are better than index based linked lists x) Branch node sub-nodes can be allocated in a group of 4 elements, which means only 1 pointer stored and easy lookup.Breezeway
Additional comments: if you have a multi-threaded system, then having a cleanup routine that iterates the tree is a big no-no. I found that having a FrontAllocator and BackAllocator allows for recreating the whole tree while another thread is still searching the tree. The Root pointers can be swapped atomically.Breezeway
Additional comment: I also suggest leaving the Quadtree Leaf split threshold as some sort of a parameter. From my tests I can see that it depends heavily on your personal use case and tweaking this parameter can improve performance a lot. For a space game with vast distances, I'm using 64 as the split threshold.Breezeway
Furthermore, storing the x, y, radius inside the Branch nodes yields 10% better performance, probably due to data locality.Breezeway
@JormaRebane Cheers and thanks for the informative comments! I wish the question was still open so you could contribute. In terms of memory allocators, personally I use this brutish solution which is like a series of freelists per thread with power of two sizes which avoid the linked list (even in indexed form). A buddy allocator sounds like a superior and more generalized solution. I've always been a bit heavy-handed with these things. I was afraid to go too deep into it though since I wasn't quite sure of the target level of the audience asking such Qs. IMO the world of "average" code...Lucaslucca
@JormaRebane [...] could use plenty of improvement just storing things more contiguously, even for linked structures, as a start over pointing to sporadic places in memory. For unions, normally I rep my nodes this way but I might have overthought the question trying to be language-agnostic. For splitting, 64 is quite intense to me but your use case can cluster a lot of elements in a tiny area? One of the goofy things I find in VFX is that I just can't predict the way that artists will create content whatsoever, so I've attempted to find constants [...]Lucaslucca
@JormaRebane [...] that fit a sort of "don't screw up too much in pathological cases I can't predict" case. My tendency is to think of the tree as a balanced tree even if it won't be just due to that blind spot. Anyway, cheers!Lucaslucca
@Dragon Energy Thank you for your reply, as always lots of good ideas. The split threshold of 64 is just really specific to the particular space game. There are a lot of missiles and projectiles flying around, which is a situation where the performance matters, so we optimize to match that intense situation. Furthermore, I am trying more spatial segmentation algorithm, and surprisingly a Grid is much faster, even though our game universe is gigantic.Breezeway
@Dragon Energy [...] Also, I am further looking into trying out the 2-level hierarchical grid. I would expect it to have similar performance with less memory usage due to empty cells. But I usually defer to empirical time measurements, because often the assumptions about performance happen to be incorrect or unexpectedly insignificant.Breezeway
Have you considered storing the nodes in a hashmap? By storing the Morton Code in each node (4 bytes) you can calculate the children and parent hashes, the depth, size and position of the node. And by using the last 3 bits of the hash is possible to build a quite fast hashmap which has a very balanced load (afaik). By this way you don't need to allocate nodes in packs of 4 in an array.Arbuckle
@EnderLook Well-distributed hash maps with linear probing are beautiful to me,but I've never had much success personally in getting faster results over even tree-based solutions when it comes to spatial indices. That might be tied though to my use cases or just what I suck at as a programmer. :-D The spatial hash has been, in my experience, and possibly given my personal faults, the most difficult structure to optimize in my use cases.Lucaslucca
@EnderLook One unorthodox solution that I'd had more success with though is quantization. If we have, say floating values for x, y, (and possibly z), I have had some cases where I was able to successfully implement faster solutions than my competitors by quantizing them to a small fixed range and partitioning, relative to their AABBs. It boiled down to something like (int16_t)((some_float - aabb.min) * abb.inv_max) * some_factor). In cases like mesh welding of vertices, I've gotten the best results like this.Lucaslucca
@EnderLook One of the difficulties I've always found just in my experience, and it might just be due to personal faults and failures, is that I've found a tendency when I can't reason about the general work involved for the machine that I tend to create solutions which aren't so efficient for it in the widest range of use cases. If I can poke some worst-case scenario in a hypothetical structure like a hash map using morton/z-order, it has been in my experience that I can likewise find similar bottlenecks at runtime before even reaching for the profiler...Lucaslucca
Thanks for your detailed explanation. I actually don't have any experience with quadtrees, last month I just implemented an octree for a pathfinding system, and during my research about those structures, I read that Morton codes were useful for both quadtrees and octrees, but I don't have any real proof about those words.Arbuckle
@EnderLook Immediately the linear quadtree built from z-ordering has so many desirable properties with improved data locality. And I've not actually tried comparing it directly to any of my quadtree implementations so I might be able to do a better job with it than a quadtree (but most of the time I don't use quadtrees that much compared to grids). Hard-coding multiple levels of grids has always given me the best results in spite of being the most brutish/dumb. What I find difficult about linear/morton quadtrees [...]Lucaslucca
@EnderLook [...] and I've only tried them in cases involving points (specifically for efficient welding of coincident points in 3-dimensions), the time it took me to encode and sort the data into a nice one-dimensional array ended up taking more time, at least with my implementation (which could have been poor), than just more brutish multi-dimensional array solutions like a 3D grid where we just quantize/clamp the input data and stick it into its appropriate cell. Where I could see the linear quadtree being great even with my attempts [...]Lucaslucca
@EnderLook [...] is in more static cases where we're spending more time searching the structure than building it. Most of the cases I deal with though are quite different and frequently involve building and updating/rebulding the structure just about as often as we're searching it.Lucaslucca
@EnderLook One thought that's been tickling me lately is the idea of N-ary trees as spatial indices where N is a fairly large number, like 32 or 64, reducing the number of nodes traversed down from log2/4/8(N) to log32/64(N). If, at every branch node, we had a way to encode/hash the search area in constant time in ways similar to Morton encoding without comparing every one of the 32/64 child nodes/links linearly to determine which child link(s) to follow, that seems much better suited for our hardware and can load in less memory overall traversing from root to leaf.Lucaslucca
How is QuadNodeData implemented? I'm struggling to understand some of the code that involves itArmistead
Just stumbled across this answer. This is a gem. Can't upvote enough!Beatup
L
77

2. Fundamentals

For this answer (sorry, I ran out of the character limit again), I'll focus more on fundamentals aimed at someone new to these structures.

All right, so let's say we have a bunch of elements like this in space:

enter image description here

And we want to find out what element is under the mouse cursor, or what elements intersect/collide with each other, or what the nearest element is to another element, or something of this sort.

In that case, if the only data we had was a bunch of element positions and sizes/radii in space, we'd have to loop through everything to find out what element is within a given search area. For collision detection, we'd have to loop through every single element and then, for each element, loop through all other elements, making it an explosive quadratic complexity algorithm. That isn't going to hold up on non-trivial input sizes.

Subdivide

So what can we do about this problem? One of the straightforward approaches is to subdivide the search space (screen, e.g.) into a fixed number of cells, like so:

enter image description here

Now let's say you want to find the element under your mouse cursor at position (cx, cy). In that case, all we have to do is check the elements in the cell under mouse cursor:

grid_x = floor(cx / cell_size);
grid_y = floor(cy / cell_size);
for each element in cell(grid_x, grid_y):
{
     if element is under cx,cy:
         do something with element (hover highlight it, e.g)
}

Similar thing for collision detection. If we want to see what elements intersect (collide) with a given element:

grid_x1 = floor(element.x1 / cell_size);
grid_y1 = floor(element.y1 / cell_size);
grid_x2 = floor(element.x2 / cell_size);
grid_y2 = floor(element.y2 / cell_size);

for grid_y = grid_y1, grid_y2:
{
     for grid_x = grid_x1, grid_x2:
     {
         for each other_element in cell(grid_x, grid_y):
         {
             if element != other_element and collide(element, other_element):
             {
                 // The two elements intersect. Do something in response
                 // to the collision.
             }
         }
     }
}

And I recommend people start this way by dividing up the space/screen into a fixed number of grid cells like 10x10, or 100x100, or even 1000x1000. Some people might think 1000x1000 would be explosive in memory use but you can make each cell require only 4 bytes with 32-bit integers, like so:

enter image description here

... at which point even a million cells takes less than 4 megabytes.

Downside of a Fixed-Resolution Grid

The fixed resolution grid is a fantastic data structure for this problem if you ask me (my personal favorite for collision detection), but it does have some weaknesses.

Imagine you have a Lord of the Rings video game. Let's say many of your units are small units on the map like humans, orcs, and elves. However, you also have some gigantic units like dragons and ents.

Here a problem with the fixed resolution of the grid is that while your cell sizes might be optimal for storing those small units like humans and elves and orcs which occupy just 1 cell most of the time, the huge dudes like dragons and ents might want to occupy many cells, say, 400 cells (20x20). As a result we have to insert those big guys into many cells and store a lot of redundant data.

Also let's say you want to search a large rectangular region of the map for units of interest. In that case, you might have to check way more cells than theoretically optimal.

This is the main disadvantage of a fixed resolution grid*. We end up potentially having to insert large things and store them into far more cells than we should ideally have to store, and for large search areas, we may have to check far more cells than we should ideally have to search.

  • That said, putting aside the theory, often you can work with grids in a way that is very cache-friendly in ways similar to image processing. As a result, while it has these theoretical disadvantages, in practice the simplicity and ease of implementing cache-friendly traversal patterns can make the grid a lot better than it sounds.

Quadtrees

So quadtrees are one solution to this problem. Instead of using a fixed resolution grid, so to speak, they adapt the resolution based on some criteria, while subdividing/splitting into 4 child cells to increase resolution. For example, we might say a cell should split if there are more than 2 children in a given cell. In that case, this:

enter image description here

Ends up becoming this:

enter image description here

And now we have a pretty nice representation where no cell stores more than 2 elements. Meanwhile let's consider what happens if we insert a huge dragon:

enter image description here

Here, unlike the fixed resolution grid, the dragon can just be inserted into one cell since the cell it occupies only has one element in it. Likewise if we search a big area of the map, we won't have to check so many cells unless there are many elements occupying the cells.

Implementation

So how do we implement one of these thingies? Well, it's a tree at the end of the day, and a 4-ary tree at that. So we start off with the notion of a root node with 4 children, but they're currently null/nil and the root is also a leaf at the moment:

enter image description here

Insertion

Let's start inserting some elements, and again for simplicity, let's say a node should split when it has more than 2 elements. So we'll insert an element, and when we insert an element we should insert it to the leaves (cells) in which it belongs. In this case we only have one, the root node/cell, and so we'll insert it there:

enter image description here

... and let's insert another:

enter image description here

... and yet another:

enter image description here

And now we have more than 2 elements in a leaf node. It should now split. At this point we create 4 children to the leaf node (our root in this case) and then transfer the elements from the leaf being split (the root) into the appropriate quadrants based on the area/cell each element occupies in space:

enter image description here

Let's insert another element, and again to the appropriate leaf in which it belongs:

enter image description here

... and another:

enter image description here

And now we have more than 2 elements in a leaf again, so we should split it into 4 quadrants (children):

enter image description here

And that's the basic idea. One of the things you might notice is that when we're inserting elements that aren't infinitesimally small points, they can easily overlap multiple cells/nodes.

As a result if we have many elements that overlap many boundaries between cells, they could end up wanting to subdivide a whole lot, possibly infinitely. To mitigate this issue, some people choose to split the element. If all you associate with an element is a rectangle, it's fairly straightforward to dice up rectangles. Other people might just put a depth/recursion limit on how much the tree can split. I tend to prefer the latter solution for collision detection scenarios between these two since I find it at least easier to implement more efficiently. However, there's another alternative: loose representations, and that'll be covered in a different section.

Also if you have elements right on top of each other, then your tree could want to split indefinitely even if you're storing infinitesimally small points. For example, if you have 25 points right on top of each other in space (a scenario I encounter rather often in VFX), then your tree will want to split indefinitely without a recursion/depth limit no matter what. As a result, to handle pathological cases, you might need a depth limit even if you do dice up elements.

Removing Elements

Removing elements is covered in the first answer along with removing nodes to clean up the tree and remove empty leaves. But basically all we do to remove an element using my proposed approach is just descend down the tree to where the leaf/leaves in which the element is stored (which you can determine using its rectangle, e.g.), and remove it from those leaves.

Then to start removing empty leaf nodes, we use a deferred cleanup approach covered in the original answer.

Conclusion

I'm running short on time but will try to come back to this one and keep improving the answer. If people want an exercise, I would suggest implementing a plain old fixed-resolution grid, and see if you can get it down to where each cell is just a 32-bit integer. First understand the grid and its inherent problems before considering the quadtree, and you may be just fine with the grid. It might even provide you the most optimal solution depending on how efficiently you can implement a grid vs. a quadtree.

Edit: Fine/Coarse Grids and Row Optimizations

I got lots of questions on this so I'll go briefly into it. It's actually incredibly dumb-simple and might disappoint people who thought it was some fancy thing. So let's start with a simple fixed-resolution grid representation. I'll use pointers here to make it as simple as possible (although I recommend using an array and indices later on to have more control over memory use and access patterns).

// Stores a node in a grid cell.
struct Node
{
    // Stores a pointer to the next node or null if we're at the end of
    // the list.
    Node* next = nullptr;

    // Stores a pointer to the element in the cell.
    Element* element = nullptr;
};

// Stores a 1000x1000 grid. Hard-coded for simple illustration.
Node* grid[1000][1000] = {};

As covered in the other answers, fixed-resolution grids are actually much more decent than they look even if they seem so dumb compared to tree-based solutions with variable resolution. However, they do come with a drawback that if we want to search a large parameter (like a huge circular or rectangular area), they have to loop through many grid cells. So we can mitigate that cost by storing a coarser grid:

// Stores a lower-resolution 500x500 parent grid (can use bitsets instead
// of bools). Stores true if one or more elements occupies the cell or
// false if the cell is empty.
bool parent_grid[500][500] = {};

// Stores an even lower-resolution 100x100 grid. It functions similarly
// as the parent grid, storing true if ane element occupies the cell or
// false otherwise.
bool grandparent_grid[100][100] = {};

We can keep going and you can adjust the resolutions and the number of grids you use as you see fit. By doing this, when we want to search a large parameter, we start off checking the grandparent grid before we search the parent grid, and the parent before we check the full-resolution grid. We only proceed if the cells are not completely empty. This can help us exclude a whole bunch of cells to check from the highest-resolution grid in lots of cases involving large search parameters.

That's really all there is to it. It does require storing all the highest-resolution cells in memory unlike a quadtree, but I've always found it much easier to optimize since we don't have to chase pointers/indices around to traverse the children of each tree node. Instead, we just do array lookups with access patterns that are very cache-friendly.

Row-Based Optimizations

So row-based optimization is also dumb simple (although it only applies when we use arrays and indices instead of pointers to nodes*).

  • Or custom memory allocators, but I really don't recommend using them for most purposes as it's quite unwieldy to have to deal with allocator and data structure separately. It is so much simpler for linked structures to emulate the effect of bulk allocation/deallocation and contiguous access patterns by just storing/reallocating arrays (ex: std::vector) and indices into them. For linked structures in particular and given that we now have 64-bit addressing, it's especially helpful to cut down the size of links to 32-bits or less by turning them into indices into a particular array unless you actually need to store more than 2^32-1 elements in your data structure.

If we imagine a grid like this:

enter image description here

... because we are storing all the nodes in one list structure for the entire grid, the memory stride to get from one element in a cell to the next could be quite large, causing us to skip around a lot in memory and incurring more cache misses as a result (and also loading more irrelevant data into cache lines).

We could eliminate that completely by storing a whole separate node list for every single cell (in which case all the elements in a cell could be stored perfectly contiguously), but that could be quite explosive in memory use and very inefficient in other ways. So the balance is to just store a separate node list per row. We improve spatial locality that way without having an explosive number of separate lists since there aren't that many rows compared to the total number of cells (rows * columns). Another thing you can do when you do that is that when a row is completely empty, you can deallocate even the memory for the grid cells for that entire row and just turn the entire row into a null pointer.

Finally, this enables more opportunities to parallelize insertion/removal/access, since you can guarantee that it's safe provided no two threads are modifying/accessing the same row at one time (something that is often quite easy to ensure).

Lucaslucca answered 20/1, 2018 at 11:23 Comment(7)
For the grid option, how do you store each cell in only 4 bytes? 4 bytes would be used for the pointer to the next element in the cell, and what would be left for info on the element itself? Or are you saying that the cell points to an element in the linked list, but then the element stores the extra info, and the cell is just "dumb?"Apophyge
@jdk1.0 The cell itself can just store a 32-bit integer to the index of the first element and doesn't have to store the element data (it's just a NxM table of 32-bit integers). The element then stores an integer along with whatever data it needs to point to the next (it will need more than 4 bytes even using 32-bit integers for the links to store the element data). So like 100x100 empty cells would take 10,000 integers or 40k bytes. This is just talking empty cells instead of occupied cells. You can visualize it as a 100x100 table of 32-bit links to the "head" of the list.Lucaslucca
@jdk1.0 The grid cells don't need to store element data. It can just be a pointer/link to the first element making empty cells store only 4 bytes (assuming 32-bit integers). They don't have to store the element data. That could be stored in an occupied "Node" as opposed to a "Cell". Especially with huge grids, a good fraction of your grid cells will likely be unoccupied in common cases. So it is often helpful to avoid making them store the data of the first element they occupy...Lucaslucca
@jdk1.0 Instead just store a number (or address -- 8 bytes if 64-bit pointers, but I recommend 32-bit integers for maxing efficiency without typically running into the upper limits) in cells. Distinguish the notion of "cell" from "node". Nodes cannot be empty. Cells can be -- they are just pointers. And "pointers" can be 32-bit integers indexing an array.Lucaslucca
@jdk1.0 You can actually use less than 32-bits as well, like 24-bits if you don't need more than 2^24-1 nodes (minus one since we need a constant value to indicate that a cell is empty or that we've reached the end of its list -- the integer equivalent of a null pointer), or even 16-bits if you don't need more than 2^16-1.Lucaslucca
Oof what a big answer.Conversationalist
How should I modify cell and node to include cases where some node occupies more than one cell ? This is not rare in real world where each node has some non-zero size. How should I quickly remove node if it resides in more cells simultaneously ? It is not clear from your explanation.Bresnahan
L
56

5. Loose/Tight Double-Grid With 500k Agents

enter image description here

Above is a little GIF showing 500,000 agents bouncing off of each other every time step using a new "loose/tight grid" data structure I was inspired to create after writing the answer about loose quadtrees. I'll try to go over how it works.

It is the best-performing data structure so far among all the ones I've shown that I've implemented (though it could just be me), handling half a million agents better than the initial quadtree handled 100k, and better than the loose quadtree handled 250k. It also requires the least amount of memory and has the most stable memory use among these three. This is all still working in just one thread, no SIMD code, no fancy micro-optimizations as I typically apply for production code -- just a straightforward implementation from a couple hours of work.

I also improved the drawing bottlenecks without improving my rasterization code whatsoever. It's because the grid lets me easily traverse it in a way that's cache-friendly for image processing (drawing the elements in the cells of the grid one-by-one coincidentally happens to lead to very cache-friendly image processing patterns when rasterizing).

Funnily enough it also took the shortest time for me to implement (only 2 hours while I spent 5 or 6 hours on the loose quadtree), and it also requires the least amount of code (and arguably has the simplest code). Though that might just be because I've accumulated so much experience implementing grids.

Loose/Tight Double Grid

So I covered how grids worked in the fundamentals section (see part 2), but this is a "loose grid". Each grid cell stores its own bounding box that is allowed to shrink as elements are removed and grow as elements are added. As a result each element only needs to be inserted once into the grid based on which cell its center position lands inside, like so:

// Ideally use multiplication here with inv_cell_w or inv_cell_h.
int cell_x = clamp(floor(elt_x / cell_w), 0, num_cols-1);
int cell_y = clamp(floor(ely_y / cell_h), 0, num_rows-1);
int cell_idx = cell_y*num_rows + cell_x;
// Insert element to cell at 'cell_idx' and expand the loose cell's AABB.

Cells store elements and AABBs like this:

struct LGridLooseCell
{
    // Stores the index to the first element using an indexed SLL.
    int head;

    // Stores the extents of the grid cell relative to the upper-left corner
    // of the grid which expands and shrinks with the elements inserted and 
    // removed.
    float l, t, r, b;
};

However, loose cells pose a conceptual problem. Given that they have these variable-sized bounding boxes which can grow huge if we insert a huge element, how do we avoid checking every single freaking cell of the grid when we want to find out which loose cells and corresponding elements intersect a search rectangle? There could be a case where we're searching the upper-right corner of the loose grid but there's a cell in the lower-left on the opposite side which has grown large enough to intersect that area as well. Without a solution to this problem, we're reduced to having to check all the loose cells in linear time for matches.

... and the solution is that actually this is a "double grid". The loose grid cells themselves are partitioned into a tight grid. When we do the analogical search above, we first look through the tight grid like so:

tx1 = clamp(floor(search_x1 / cell_w), 0, num_cols-1);
tx2 = clamp(floor(search_x2 / cell_w), 0, num_cols-1);
ty1 = clamp(floor(search_y1 / cell_h), 0, num_rows-1);
ty2 = clamp(floor(search_y2 / cell_h), 0, num_rows-1);

for ty = ty1, ty2:
{
    trow = ty * num_cols
    for tx = tx1, tx2:
    {
        tight_cell = tight_cells[trow + tx];
        for each loose_cell in tight_cell:
        {
            if loose_cell intersects search area:
            {
                for each element in loose_cell:
                {
                    if element intersects search area:
                        add element to query results
                }
            }
        }
    }
}

Tight cells store a singly-linked index list of loose cells, like so:

struct LGridLooseCellNode
{
    // Points to the next loose cell node in the tight cell.
    int next;

    // Stores an index to the loose cell.
    int cell_idx;
};

struct LGridTightCell
{
    // Stores the index to the first loose cell node in the tight cell using 
    // an indexed SLL.
    int head;
};

And voila, that's the basic idea of the "loose double-grid". When we insert an element, we expand the loose cell's AABB just as we do for a loose quadtree, only in constant-time instead of logarithmic time. However, we also insert the loose cell to the tight grid in constant time based on its rectangle (and it could be inserted into multiple cells).

The combo of these two (tight grid to rapidly find loose cells, and loose cells to rapidly find elements) gives a very lovely data structure where each element is inserted into a single cell with constant time searches, insertion, and removals.

The only big downside I see is that we do have to store all these cells and potentially still have to search more cells than we need, but they're still reasonably cheap (20 bytes per cell in my case) and it's easy to traverse the cells on searches in a very cache-friendly access pattern.

I recommend giving this idea of "loose grids" a try. It is arguably a lot easier to implement than quadtrees and loose quadtrees and, more importantly, optimize, since it immediately lends itself to a cache-friendly memory layout. As a super cool bonus, if you can anticipate the number of agents in your world in advance, it is almost 100% perfectly stable and immediately in terms of memory use, since an element always occupies exactly one cell, and the number of cells total is fixed (since they don't subdivide/split). The only minor instability in memory use is that those loose cells might expand and get inserted into additional tight cells in the coarser grid every now and then, yet this should be quite infrequent. As a result, the memory use is very stable/consistent/predictable and also often the corresponding frame rates. That could be a huge bonus for certain hardware and software where you want to pre-allocate all memory in advance.

It's also really easy to make it work with SIMD to do multiple coherent queries simultaneously with vectorized code (in addition to multithreading), since the traversal, if we can even call it that, is flat (it's just a constant-time lookup into a cell index which involves some arithmetic). As a result it's fairly easy to apply optimization strategies similar to ray packets that Intel applies to their raytracing kernel/BVH (Embree) to test for multiple coherent rays simultaneously (in our case they would be "agent packets" for collision), except without such fancy/complex code since grid "traversal" is so much simpler.

On Memory Use and Efficiency

I covered this a bit in part 1 on efficient quadtrees, but reducing memory use is often the key to speed these days since our processors are so fast once you get the data into, say, L1 or a register, but DRAM access is relatively so, so slow. We have so precious little fast memory still even if we have an insane amount of slow memory.

I think I'm kind of lucky starting off at a time when we had to be very frugal with memory use (though not as much as the people before me), where even a megabyte of DRAM was considered amazing. Some of the things I learned back then and habits I picked up (even though I'm far from an expert) coincidentally aligns with performance. Some of it I've had to discard as bad habits which are counter-productive these days and I've learned to embrace waste in areas where it doesn't matter. A combination of the profiler and tight deadlines helps keep me productive and not end up with priorities that are too out of whack.

So one general piece of advice I'd offer on efficiency in general, not just spatial indexes used for collision detection, is watch that memory use. If it's explosive, chances are that the solution won't be very efficient. Of course there's a grey zone where using a bit more memory for a data structure can substantially reduce processing to the point where it's beneficial only considering speed, but a lot of times reducing the amount of memory required for data structures, especially the "hot memory" that is accessed repeatedly, can translate quite proportionally to a speed improvement. All the least efficient spatial indexes I encountered in my career were the most explosive in memory use.

It's helpful to look at the amount of data you need to store and calculate, at least roughly, how much memory it should ideally require. Then compare it to how much you actually require. If the two are worlds apart, then you'll likely get a decent boost trimming down the memory use because that'll often translate to less time loading chunks of memory from the slower forms of memory in the memory hierarchy.

Lucaslucca answered 22/1, 2018 at 14:54 Comment(22)
This is really interesting. Would you be willing to share your implementation for reference? I'd like to try building something like this, but I'm confused on some of the details.Syllogism
@Syllogism Sorry for taking so long to reply. Got busy and haven't used SO in quite a while. If you're still interested, I pasted a C++ implementation (though I wrote it mostly C-style) of this loose/double grid structure: pastebin.com/t2MyZtKX I set the paste expiration to a year so hopefully you can still access it whenever you get this message.Lucaslucca
@Syllogism The fundamental idea though is first you take the simple old grid structure and make its cells store a bounding box that fits to encompass the element inserted to the cell (and we always insert an element to only one cell based on its center point). That makes our "tight grid" into a "loose grid". The problem here is that if you want to find what cells are contained in a search query area, we'd have to check every cell in the entire grid now since they're all variable-sized...Lucaslucca
@Syllogism ... to see which cells overlap a search area. So that's where we solve the problem by then partitioning the loose grid cells themselves into another data structure (a tight grid in our case, though it could also be a quadtree or anything you want). That second data structure than makes it so we can quickly discover what loose cells are in a search area, then check inside the loose cells for elements within that area.Lucaslucca
@DragonEnergy Any chance at getting a peak at your SmallList.hpp file as well? I've tried to implement something like it but had various issues (notably the performance of operator[])Gaelan
structs SmallList and FreeList are missing in the pastebin, I guess they are similar to the IntList introduced by @DragonEnergy some posts below. It would be great to have access to SmallList.hpp though!Debidebilitate
@DragonEnergy Love the write ups in these answers. Can you confirm if the SmallList and FreeList are similar to IntList or provide these .hpp sources? I would love to play around with some of these implementations.Interrogative
@DragonEnergy is like the god of efficient grids and trees. If they had a github account, it would be the bible of particle engines. I'm glad I got the pastebin code a week before it expired! I'm thinking of using a vector<> instead of the SmallList for now to see if it works for my use case without that optimization.Apophyge
@DragonEnergy do you think you could post the code on GitHub? The Paste Bin link has expired and I am really interested to see how you actually implemented this in code.Syllabify
Me too, I'm also interested to port something iike this to javascript.Acetylate
@jdk1.0 @Acetylate @Debidebilitate Apologies it's been quite a while and a new comp since when I was actively working on this code. I still have an old copy if you guys like or I can just send it to you. I'm really lazy about this hosting stuff. As for SmallList vs. std::vector in C++, I still find a huge benefit to using SBO optimization when the spatial index queries return small results (which tends to be common-case for me). i don't think the C++11+ guys would have added SBO optimizations to std::string otherwise. A small vector is glaringly missing from the standard C++ lib IMHO.Lucaslucca
@jdk1.0 My suspicion for why small/tiny vectors are missing in the C++ standard lib is that they're very difficult to generalize with r-values, swaps, and move ctors given that the generic type T can be a non-trivially copyable/destructible user-defined type. I tend to implement them in a de-generalized C-like way using memcpy and such to implement such ctors... but they aren't generalized enough to be standard lib material. But they're very effective optimizations. With std::string SBO optimizations are reasonable to apply since string characters are always PODs.Lucaslucca
Also if not my implementation of SmallList, there are various open source implementations of tiny_vector (including one used by Facebook) which uses the same SBO optimization and is compliant to standard sequence interfaces. I do find it a big win in these contexts still, and that's through repeated measurements. When we can use stack over heap, it can be a huge win since stack memory is often "hot" and loaded already in a cache line due to its simple LIFO access patterns and temporal locality..Lucaslucca
@jdk1.0 ... that said, if you use std::vector, what I'd try after if you notice the query functions showing up as hotspots is to pass in a reference to a vector as a parameter and fill it out instead of returning it by value. Then you can reuse the same vector (clear method doesn't deallocate its memory). Then you should get the temporal locality and performance comparable to the stack.Lucaslucca
@Syllabify I dug up the code on my old machine and pasted it again in PasteBin: pastebin.com/BKNWFpVW. I hope you'll forgive my lazy paste over a GitHub account. I don't want to bother with thinking about more accounts and stuff on top of what I already use for work. :-D But feel free to grab the code and put it on GitHub if you like. And adapt it all you like and use it in commercial applications, take full credit, whatever. I don't mind. It's just a couple hours of work and I like seeing people do cool things and I prefer to be in the background.Lucaslucca
If using the solution generates any cool insight/ideas, I'm always down to exchange ideas. I get a kick out of it and I'm always looking to improve upon my own solutions. Also I'm just really happy if I can just encounter fewer spatial index implementations in the future which take explosive amounts of DRAM and make frame rates plummet in some cases. That would be a cool enough reward for me and a massive savings in terms of the time I spent having to debug and replace such solutions. I did write that code in a weird procedural style which I never normally do in C++...Lucaslucca
... it was a goofy thing I was trying to do to make it as easy as possible to port to other languages. It was maybe a dumb idea in retrospect and I should have focused more on communicating the broad concept instead of making the code as portable as I can.Lucaslucca
@Syllabify OMG. I dug up the wrong code. That's a uniform-sized grid from the dirty tricks section. I'm getting kinda senile these days turning 40 but it's also why I like to just give away this stuff. I want to see what young and bright minds can do with it. Here's the correct link to the loose grid thingie: pastebin.com/swvwTgnd. That's a newer version than what I used originally with some SIMD instructions and a parallel OMP loop I was trying to make it handle even more agents. Hopefully, it's still easy enough to adapt/understand/rewrite however you like.Lucaslucca
If I find a bit of spare time, I'll try to dig up the original scalar version I used for the answer which is easier to understand and document it more, maybe ditch the goofy procedural style and turn it into a proper object with methods. Hopefully this can get the determined ones jumpstarted to having a solution to use, and I recommend using this answer and the pseudocode as documentation to help comprehend the source code itself. It does perform pretty well in my experience -- very often professional particle self-collision solutions are limited to tens of thousands of particles in realtime.Lucaslucca
@DragonEnergy I'd definitely be interested in seeing the original scalar version. I'm trying to learn C++ by porting a game server I wrote in Go, and I'm getting a bit lost in the SIMD code.Jaquesdalcroze
For anyone interested, I was the checking the code in the pastebin to get some inspiration on a slightly different implementation, and I noticed there's a typo/bug in line 574, where the operation should be 'my - hy', but instead is 'my - hx'.Spurn
@JoãoHaas Awesome catch and thanks! Sorry about that goof.Lucaslucca
L
45

4. Loose Quadtree

All right, I wanted to take some time to implement and explain loose quadtrees, since I find them very interesting and possibly even the most well-balanced for the widest variety of use cases involving very dynamic scenes.

So I ended up implementing one last night and spent some time tweaking and tuning and profiling it. Here's a teaser with a quarter million dynamic agents, all moving and bouncing off of each other every time step:

enter image description here

The frame rates start to suffer when I zoom out to look at all quarter million agents along with all the bounding rectangles of the loose quadtree, but that's mainly due to bottlenecks in my drawing functions. They start to become hotspots if I zoom out to draw everything on the screen all at once and I didn't bother to optimize them at all. Here's how it works on a basic level with very few agents:

enter image description here

Loose Quadtree

All right, so what are loose quadtrees? They're basically quadtrees whose nodes are not perfectly split down the center into four even quadrants. Instead their AABBs (bounding rectangles) could overlap and be larger or often even be smaller than what you'd get if you split a node perfectly down the center into 4 quadrants.

So in this case we absolutely have to store the bounding boxes with each node, and so I represented it like this:

struct LooseQuadNode
{
    // Stores the AABB of the node.
    float rect[4];

    // Stores the negative index to the first child for branches or the 
    // positive index to the element list for leaves.
    int children;
};

This time I tried to use single-precision floating point to see how it well performs, and it did a very decent job.

What's the Point?

All right, so what's the point? The main thing you can exploit with a loose quadtree is that you can treat each element you insert to the quadtree like a single point for the sake of insertion and removal. Therefore an element is never inserted into more than one leaf node in the entire tree since it's treated like an infinitesimally small point.

However, as we insert these "element points" into the tree, we expand the bounding boxes of each node we insert to in order to encompass the element's boundaries (the element's rectangle, e.g.). That allows us to find these elements reliably when we do a search query (ex: searching for all elements that intersect a rectangle or circle area).

Pros:

  • Even the most gigantic agent only has to be inserted into one leaf node and will take no more memory than the smallest. As a result it is well-suited for scenes with elements whose sizes vary wildly from one to the next, and that's what I was stress testing in the 250k agent demo above.
  • Uses less memory per element, especially huge elements.

Cons:

  • While this speeds up insertion and removal, it inevitably slows down searches into the tree. What was previously a few basic comparisons with a center point of a node to determine which quadrants to descend into turns into a loop having to check each rectangle of each child to determine which ones intersect a search area.
  • Uses more memory per node (5x more in my case).

Expensive Queries

This first con would be pretty horrible for static elements since all we do is build the tree and search it in those cases. And I found with this loose quadtree that, in spite of spending a few hours tweaking and tuning it, there is a huge hotspot involved in querying it:

enter image description here

That said, this is actually my "personal best" implementation of a quadtree so far for dynamic scenes (though keep in mind that I favor hierarchical grids for this purpose and don't have that much experience using quadtrees for dynamic scenes) in spite of this glaring con. And it's because for dynamic scenes at least, we have to constantly move elements every single time step, and so there's a whole lot more to do with the tree than just querying it. It has to be updated all the time, and this actually does a pretty decent job at that.

What I like about the loose quadtree is that you can feel safe even if you have a boatload of massive elements in addition to a boatload of the teeniest elements. The massive elements won't take any more memory than the small ones. As a result if I was writing a video game with a massive world and wanted to just throw everything into one central spatial index to accelerate everything without bothering with a hierarchy of data structures as I usually do, then loose quadtrees and loose octrees might be perfectly balanced as the "one central universal data structure if we're going to be using just one for an entire dynamic world".

Memory Use

In terms of memory use, while elements take less memory (especially massive ones), nodes take quite a bit more compared to my implementations where nodes don't even need to store an AABB. I found overall in a variety of test cases, including ones with many gigantic elements, that the loose quadtree tends to take slightly more memory with its beefy nodes (~33% more as a rough estimate). That said, it is performing better than the quadtree implementation in my original answer.

On the plus side, the memory use is more stable (which tends to translate to more stable and smooth frame rates). My original answer's quadtree took around 5+ secs before the memory usage became perfectly stable. This one tends to become stable just a second or two after starting it up, and most likely because elements never have to be inserted more than once (even small elements could be inserted twice in my original quadtree if they're overlapping two or more nodes at the boundaries). As a result the data structure quickly discovers the required amount of memory to allocate against all cases, to so to speak.

Theory

So let's cover the basic theory. I recommend starting by implementing a regular quadtree first and understanding it before moving to loose versions as they are a bit harder to implement. When we start out with an empty tree, you can imagine it as also having an empty rectangle.

Let's insert one element:

enter image description here

Since we only have a root node at the moment which is also a leaf, we just insert it to that. Upon doing this, the previously empty rectangle of the root node now encompasses the element we inserted (shown in red dotted lines). Let's insert another:

enter image description here

We expand the AABBs of the nodes we traverse as we're inserting (this time just the root) by the AABBs of the elements we're inserting. Let's insert another, and let's say nodes should split when they contain more than 2 elements:

enter image description here

In this case we have more than 2 elements in a leaf node (our root), so we should split it into 4 quadrants. This is pretty much the same as splitting a regular point-based quadtree except we, again, expand bounding boxes as we transfer children. We begin by considering the center position of the node being split:

enter image description here

Now we have 4 children to our root node and each one also stores its also tight-fitting bounding box (shown in green). Let's insert another element:

enter image description here

Here you can see that inserting this element not only expanded the rectangle of the lower-left child, but also the root (we expand all AABBs along the path we're inserting). Let's insert another:

enter image description here

In this case we have 3 elements again in a leaf node, so we should split:

enter image description here

Just like so. Now what about that circle on the lower left? It appears to be intersecting 2 quadrants. However, we only consider one point of the element (ex: its center) to determine the quadrant to which it belongs. So that circle actually only gets inserted to the lower-left quadrant.

However, the lower-left quadrant's bounding box is expanded to encompass its extents (shown in cyan, and hope you guys don't mind but I changed the BG color since it was getting hard to see the colors), and so the AABBs of the nodes at level 2 (shown in cyan) actually spill into each other's quadrants.

The fact that each quadrant stores its own rectangle which is always guaranteed to encompass its elements is what allows us to insert an element to just one leaf node even if its area intersects multiple nodes. Instead we expand the leaf node's bounding box instead of inserting the element to multiple nodes.

Updating AABBs

So this might lead to the question, when are AABBs updated? If we only expand AABBs upon inserting elements, they'll just tend to grow bigger and bigger. How do we shrink them when the elements are removed? There are many ways to tackle this, but I do it by updating the bounding boxes of the entire hierarchy in that "cleanup" method described in my original answer. That seems to be fast enough (doesn't even show up as a hotspot).

Comparing to Grids

I still can't seem to implement this nearly as efficiently for collision detection as my hierarchical grid implementations, but again that might just be more about me than the data structure. The main difficulty I find with tree structures is easily controlling where everything is in memory and how it is accessed. With the grid you can make sure that all columns of a row are contiguous and laid out sequentially, for example, and make sure you access them in a sequential fashion along with the elements contiguously stored in that row. With a tree, the memory access tends to be a bit sporadic just by nature and also tends to degrade rapidly as trees want to transfer elements a lot more often as nodes get split into multiple children. That said, if I wanted to use a spatial index which was a tree, I'm really digging these loose variants so far, and ideas are popping in my head for implementing a "loose grid".

Conclusion

So that's loose quadtrees in a nutshell, and it basically has the insertion/removal logic of a regular quadtree that just stores points except that it expands/updates AABBs on the way. For searching, we end up traversing all child nodes whose rectangles intersect our search area.

I hope people don't mind me posting so many lengthy answers. I'm really getting a kick out of writing them and it has been a useful exercise for me in revisiting quadtrees to attempt to write all these answers. I'm also contemplating a book on these topics at some point (though it'll be in Japanese) and writing some answers here, while hasty and in English, helps me kind of put everything together in my brain. Now I just need someone to ask for an explanation of how to write efficient octrees or grids for the purpose of collision detection to give me an excuse to do the same on those subjects.

Lucaslucca answered 22/1, 2018 at 6:26 Comment(23)
"I hope people don't mind me posting so many lengthy answers. I'm really getting a kick out of writing them and it has been a useful exercise for me in revisiting quadtrees to attempt to write all these answers. I'm also contemplating a book on these topics..." might I suggest starting a blog, instead? Seems like the perfect medium for the types of content you like to write.Spanjian
@Snowman I like the format of this site so much better! But also I'm a weird one -- I'm seeking to learn just as much through writing answers as asking questions because the former kind of seeks to validate what I've learned -- only thing that bugs me is when I pour a lot of time in an answer and get DVed without any comments, or if there is a very terse one but I try to start a discussion and things go silent... at which point I'm taking stabs in the dark and not learning anything. :-(Lucaslucca
@Snowman The main turn off to me of a blog is that it kind of puts me above the readers and commenters. I want to be on equal, flat footing, and see rival answers and see if I can learn a thing or two from other answers. It's the way the question starts buzzing activity that's lovely to me, and I prefer to just be one among many trying their best to provide a helpful solution. Like for this one, my dream was after explaining everything, someone else pitches in and shows how to implement quadtrees that handles collision detection among millions of agents [...]Lucaslucca
[...] I have never been able to get that much performance out of quadtrees for dynamic content, so I keep hoping for someone to pitch in with like an Ingo Wald-level solution that blows my mind. Though answering this question even without such answers lead to me learning something really new with this "loose/tight double grid" idea that popped into my head after trying to implement a loose quadtree and more deeply thinking about its pros and cons than I ever did before, all thanks to this question.Lucaslucca
not every site is ideal for every type of content presentation. SO is good for Q&A, reddit for discussions, blogs for whimsical "I want to discuss X in depth" essays. I'm just trying to help. You seem to know what you're talking about and have some good things to say, but have run into some friction, at least over on SoftwareEngineering.Spanjian
@Snowman Yeah, appreciate the effort! Actually over there something changed starting from around December. I just started accumulating downvotes left and right since then. I don't know if the quality of my answers have declined or maybe some people didn't like that I was bumping old posts (one person who wrote a short comment emphasized that the thread was 5-years old, so maybe he didn't like me resurrecting such an old question). I also started getting kind of unhinged since then, since I got mostly positive reception before -- like maybe a down-vote every couple of months, not every [...]Lucaslucca
@Snowman [...] single day. I think that also did end up having this vicious cycle of downgrading my answers, because I was starting to get frustrated by the silence of the downvotes and started getting argumentative in my answers. So like one bad thing lead to another, and I started getting so frustrated. But one person actually told me that my long-winded answers were lacking clarity and worthless as a result of their verbosity... then I looked at my most popular answers before, and they were all like this -- pages and pages long...Lucaslucca
@Snowman So I started getting all confused and upset and wondering whether any of my answers were any good. :-D Then I got drunk and started deleting them left and right and got suspended for that. And that irritated me even more, since I thought I was allowed to at least delete my non-accepted answers. Anyway, I just got really confused and lost over there, and mostly I couldn't understand the voting patterns and no longer felt like I could even evaluate the quality of my own answers. I used to be able to sort of tell what answers I thought would be well-received or not.Lucaslucca
@Snowman SE seems more negative lately though, or is it just me? I thought it was just me for a good while but I noticed some new questions where you got fine answers from people like Doc Brown and others, and then someone just decided to go on a drive-by DV spree and every single answer to that question has a down-vote without comment, and all those down-votes accumulated within like a 2 minute timespan for all the answerers. So it feels like some people are just down-voting things left and right on a whim.I have a bad habit of taking it personally since I pour hours into my answers.Lucaslucca
@Snowman As for Stackoverflow, my gripe with this site is that endless questions like this: #48601503... stay open, while ones like this get closed. Even if my answers here are worthless, the question is at least way more interesting IMO than the endless questions from people who still don't quite understand how pointers and basic memory allocation work. So I feel like this site has its priorities all screwed up about what questions to keep open.Lucaslucca
@Snowman Mostly I'm bored! I used to spend lots of time answering those quick technical questions: "You forgot to update this pointer, you should use std::string and make your life easier, etc," but I got so tired and bored of such basic questions. It actually irritated me when I got upvoted for my answers to those questions since they're so basic and the information can be found with any google search or book...Lucaslucca
@Snowman Meanwhile I love these kinds of super broad questions because they at least give me an excuse to share a lot of things not found so easily elsewhere. It's difficult to find really advanced questions related to spatial indexes, e.g., so the second best thing is a super basic question like, "How do you write an efficient quadtree, BVH, octree, k-d tree", etc? "How do raytracers work?" Things like this (though I might write a book in response). But they often get closed, and mostly all I'm left looking at are those quick technical questions I no longer have interest in answering.Lucaslucca
These are really insightful answers. I really think the hierarchical grid would work for me but I'm a bit lost implementation-wise. What is the difference between the loose grid struct at the beginning and the loose grid node at the end? I wish you would show more complete code to fill in the gaps.Rickrickard
@TeamUpvote, unfortunately you're not going to get the kind of reception you want from this site. I agree with Snowman, you should start a blog! I can relate with having alot of brain energy, and wanting to share info.Oletaoletha
@TeamUpvote Blogs can spark collaboration just like SE, but an even better setup, would be to get the bulk of your ideas out in articles, and where you'd like to share them, link and summarize. That would increase traffic to your blog, and also let you cater your summary to the specific post your you're linking your article to. I would certainly love to hear more about the broad topics you're interested in!Oletaoletha
@DragonEnergy Maybe you can start a QA-like site instead of dealing with the politics of SO. Your answers and insights are invaluable to those of us who are over the I-don't-know-how-to-pointer-good hurdle and are wondering how to implement more interesting data structures that would work in production code.Apophyge
@DragonEnergy The problem with SO is that people can't site you properly. SO is not considered a sufficiant source for most paperwork and some people might steal these ideas from you. These answers could have been a series of articles on their own, but instead you chose to post them on the wrong medium to help someone. I'm not trying to take anything from you, but it would have been wiser for science to post it elsewhere. How would you feel exactly if someone posted an official paper based on your work without citing you because they couldn't?Biodynamics
@JérémiPanneton I'm completely okay with people using my work uncredited. I want to see the industry improve overall. I like cool software! :-D But maybe it is true that this site isn't really ideal for my purposes. I am not into blogs -- I benefit more from questions to spark my rambling than just sharing out of the blue.Lucaslucca
@JérémiPanneton I'm mainly a programming geek, you know. I am one of those who bore my colleagues at lunch breaks wanting to talk about our product's design and how to improve it at lunch breaks and dinner parties. I don't understand why they want to talk other stuff when we just got through a day programming cool things. I'm kind of lonely that way (I do have sort of social life but I really get into this stuff and it's my passion when I'm in the zone -- I put everything aside, sometimes to a fault -- I gotta make up for the time I lost on the computer with my wife on vacations).Lucaslucca
I understand your great intentions. The fact that you don't care being credited is fine and really personal. However, if your goal was really to see the industry improve, it should have been possible to cite your work. That's how science and research work. Personally, I had to make a research on ways to optimize space partitioning. On that subject, your SO posts were really interesting. However, like any research work, you have to cite your sources and I simply couldn't since it's not a reliable medium. Neither could I say that I had come with this all by myself. So I just found myself stuck.Biodynamics
@JérémiPanneton I'm an oddball in that I started programming as a child and through bizarre circumstances (I had this weird obsession filling hundreds and hundreds of pages of books with designs and drawing resembling video game manuals of games I wished could be made). My grandfather noticed that and got me an 80286, a compiler, and a bunch of programming books. But I didn't have internet or anyone to talk to about it who was like 7-years old, so I started rolling lots of my own unorthodox solutions like this "loose-tight grid" thingie... Unfortunately, the industry favors orthodoxy...Lucaslucca
@JérémiPanneton It's hard to find people patient enough to listen to my ideas and solutions unless I impress them with results first. Then they become more receptive. But I'm so happy just to find anyone who is receptive. I'll chat their ears off with total excitement and so eager to share what I learned in this largely lonely journey of learning these things on my own. And maybe they can just copy my solutions and take credit for them and be cited instead. I'm cool with that. I like being anonymous and there's much more complex work I do than simple quadtrees and the like.Lucaslucca
Fair enough. Have a great day sir.Biodynamics
L
39

3. Portable C Implementation

I hope people don't mind another answer but I ran out of the 30k limit. I was thinking today about how my first answer wasn't very language-agnostic. I was talking about mem allocation strategies, class templates, etc, and not all languages allow such things.

So I spent some time thinking about an efficient implementation that's almost universally applicable (an exception would be functional languages). So I ended up porting my quadtree to C in a way such that all it needs are arrays of int to store everything.

The result isn't pretty but should work very efficiently on any language that allows you to store contiguous arrays of int. For Python there are libs like ndarray in numpy. For JS there are typed arrays. For Java and C#, we can use int arrays (not Integer, those are not guaranteed to be stored contiguously and they use a lot more mem than plain old int).

C IntList

So I use one auxiliary structure built on int arrays for the entire quadtree to make it as easy as possible to port to other languages. It combines a stack/free list. This is all we need to implement everything talked about in the other answer in an efficient way.

#ifndef INT_LIST_H
#define INT_LIST_H

#ifdef __cplusplus
    #define IL_FUNC extern "C"
#else
    #define IL_FUNC
#endif

typedef struct IntList IntList;
enum {il_fixed_cap = 128};

struct IntList
{
    // Stores a fixed-size buffer in advance to avoid requiring
    // a heap allocation until we run out of space.
    int fixed[il_fixed_cap];

    // Points to the buffer used by the list. Initially this will
    // point to 'fixed'.
    int* data;

    // Stores how many integer fields each element has.
    int num_fields;

    // Stores the number of elements in the list.
    int num;

    // Stores the capacity of the array.
    int cap;

    // Stores an index to the free element or -1 if the free list
    // is empty.
    int free_element;
};

// ---------------------------------------------------------------------------------
// List Interface
// ---------------------------------------------------------------------------------
// Creates a new list of elements which each consist of integer fields.
// 'num_fields' specifies the number of integer fields each element has.
IL_FUNC void il_create(IntList* il, int num_fields);

// Destroys the specified list.
IL_FUNC void il_destroy(IntList* il);

// Returns the number of elements in the list.
IL_FUNC int il_size(const IntList* il);

// Returns the value of the specified field for the nth element.
IL_FUNC int il_get(const IntList* il, int n, int field);

// Sets the value of the specified field for the nth element.
IL_FUNC void il_set(IntList* il, int n, int field, int val);

// Clears the specified list, making it empty.
IL_FUNC void il_clear(IntList* il);

// ---------------------------------------------------------------------------------
// Stack Interface (do not mix with free list usage; use one or the other)
// ---------------------------------------------------------------------------------
// Inserts an element to the back of the list and returns an index to it.
IL_FUNC int il_push_back(IntList* il);

// Removes the element at the back of the list.
IL_FUNC void il_pop_back(IntList* il);

// ---------------------------------------------------------------------------------
// Free List Interface (do not mix with stack usage; use one or the other)
// ---------------------------------------------------------------------------------
// Inserts an element to a vacant position in the list and returns an index to it.
IL_FUNC int il_insert(IntList* il);

// Removes the nth element in the list.
IL_FUNC void il_erase(IntList* il, int n);

#endif

#include "IntList.h"
#include <stdlib.h>
#include <string.h>
#include <assert.h>

void il_create(IntList* il, int num_fields)
{
    il->data = il->fixed;
    il->num = 0;
    il->cap = il_fixed_cap;
    il->num_fields = num_fields;
    il->free_element = -1;
}

void il_destroy(IntList* il)
{
    // Free the buffer only if it was heap allocated.
    if (il->data != il->fixed)
        free(il->data);
}

void il_clear(IntList* il)
{
    il->num = 0;
    il->free_element = -1;
}

int il_size(const IntList* il)
{
    return il->num;
}

int il_get(const IntList* il, int n, int field)
{
    assert(n >= 0 && n < il->num);
    return il->data[n*il->num_fields + field];
}

void il_set(IntList* il, int n, int field, int val)
{
    assert(n >= 0 && n < il->num);
    il->data[n*il->num_fields + field] = val;
}

int il_push_back(IntList* il)
{
    const int new_pos = (il->num+1) * il->num_fields;

    // If the list is full, we need to reallocate the buffer to make room
    // for the new element.
    if (new_pos > il->cap)
    {
        // Use double the size for the new capacity.
        const int new_cap = new_pos * 2;

        // If we're pointing to the fixed buffer, allocate a new array on the
        // heap and copy the fixed buffer contents to it.
        if (il->cap == il_fixed_cap)
        {
            il->data = malloc(new_cap * sizeof(*il->data));
            memcpy(il->data, il->fixed, sizeof(il->fixed));
        }
        else
        {
            // Otherwise reallocate the heap buffer to the new size.
            il->data = realloc(il->data, new_cap * sizeof(*il->data));
        }
        // Set the old capacity to the new capacity.
        il->cap = new_cap;
    }
    return il->num++;
}

void il_pop_back(IntList* il)
{
    // Just decrement the list size.
    assert(il->num > 0);
    --il->num;
}

int il_insert(IntList* il)
{
    // If there's a free index in the free list, pop that and use it.
    if (il->free_element != -1)
    {
        const int index = il->free_element;
        const int pos = index * il->num_fields;

        // Set the free index to the next free index.
        il->free_element = il->data[pos];

        // Return the free index.
        return index;
    }
    // Otherwise insert to the back of the array.
    return il_push_back(il);
}

void il_erase(IntList* il, int n)
{
    // Push the element to the free list.
    const int pos = n * il->num_fields;
    il->data[pos] = il->free_element;
    il->free_element = n;
}

Using IntList

Using this data structure to implement everything doesn't yield the prettiest code. Instead of accessing elements and fields like this:

elements[n].field = elements[n].field + 1;

... we end up doing it like this:

il_set(&elements, n, idx_field, il_get(&elements, n, idx_field) + 1);

... which is disgusting, I know, but the point of this code is to be as efficient and portable as possible, not to be as easy to maintain as possible. The hope is that people can just use this quadtree for their projects without changing it or maintaining it.

Oh and feel free to use this code I post however you want, even for commercial projects. I would really love it if people let me know if they find it useful, but do as you wish.

C Quadtree

All right, so using the above data structure, here is the quadtree in C:

#ifndef QUADTREE_H
#define QUADTREE_H

#include "IntList.h"

#ifdef __cplusplus
    #define QTREE_FUNC extern "C"
#else
    #define QTREE_FUNC
#endif

typedef struct Quadtree Quadtree;

struct Quadtree
{
    // Stores all the nodes in the quadtree. The first node in this
    // sequence is always the root.
    IntList nodes;

    // Stores all the elements in the quadtree.
    IntList elts;

    // Stores all the element nodes in the quadtree.
    IntList enodes;

    // Stores the quadtree extents.
    int root_mx, root_my, root_sx, root_sy;

    // Maximum allowed elements in a leaf before the leaf is subdivided/split unless
    // the leaf is at the maximum allowed tree depth.
    int max_elements;

    // Stores the maximum depth allowed for the quadtree.
    int max_depth;

    // Temporary buffer used for queries.
    char* temp;

    // Stores the size of the temporary buffer.
    int temp_size;
};

// Function signature used for traversing a tree node.
typedef void QtNodeFunc(Quadtree* qt, void* user_data, int node, int depth, int mx, int my, int sx, int sy);

// Creates a quadtree with the requested extents, maximum elements per leaf, and maximum tree depth.
QTREE_FUNC void qt_create(Quadtree* qt, int width, int height, int max_elements, int max_depth);

// Destroys the quadtree.
QTREE_FUNC void qt_destroy(Quadtree* qt);

// Inserts a new element to the tree.
// Returns an index to the new element.
QTREE_FUNC int qt_insert(Quadtree* qt, int id, float x1, float y1, float x2, float y2);

// Removes the specified element from the tree.
QTREE_FUNC void qt_remove(Quadtree* qt, int element);

// Cleans up the tree, removing empty leaves.
QTREE_FUNC void qt_cleanup(Quadtree* qt);

// Outputs a list of elements found in the specified rectangle.
QTREE_FUNC void qt_query(Quadtree* qt, IntList* out, float x1, float y1, float x2, float y2, int omit_element);

// Traverses all the nodes in the tree, calling 'branch' for branch nodes and 'leaf' 
// for leaf nodes.
QTREE_FUNC void qt_traverse(Quadtree* qt, void* user_data, QtNodeFunc* branch, QtNodeFunc* leaf);

#endif

#include "Quadtree.h"
#include <stdlib.h>

enum
{
    // ----------------------------------------------------------------------------------------
    // Element node fields:
    // ----------------------------------------------------------------------------------------
    enode_num = 2,

    // Points to the next element in the leaf node. A value of -1 
    // indicates the end of the list.
    enode_idx_next = 0,

    // Stores the element index.
    enode_idx_elt = 1,

    // ----------------------------------------------------------------------------------------
    // Element fields:
    // ----------------------------------------------------------------------------------------
    elt_num = 5,

    // Stores the rectangle encompassing the element.
    elt_idx_lft = 0, elt_idx_top = 1, elt_idx_rgt = 2, elt_idx_btm = 3,

    // Stores the ID of the element.
    elt_idx_id = 4,

    // ----------------------------------------------------------------------------------------
    // Node fields:
    // ----------------------------------------------------------------------------------------
    node_num = 2,

    // Points to the first child if this node is a branch or the first element
    // if this node is a leaf.
    node_idx_fc = 0,

    // Stores the number of elements in the node or -1 if it is not a leaf.
    node_idx_num = 1,

    // ----------------------------------------------------------------------------------------
    // Node data fields:
    // ----------------------------------------------------------------------------------------
    nd_num = 6,

    // Stores the extents of the node using a centered rectangle and half-size.
    nd_idx_mx = 0, nd_idx_my = 1, nd_idx_sx = 2, nd_idx_sy = 3,

    // Stores the index of the node.
    nd_idx_index = 4,

    // Stores the depth of the node.
    nd_idx_depth = 5,
};

static void node_insert(Quadtree* qt, int index, int depth, int mx, int my, int sx, int sy, int element);

static int floor_int(float val)
{
    return (int)val;
}

static int intersect(int l1, int t1, int r1, int b1,
                     int l2, int t2, int r2, int b2)
{
    return l2 <= r1 && r2 >= l1 && t2 <= b1 && b2 >= t1;
}

void leaf_insert(Quadtree* qt, int node, int depth, int mx, int my, int sx, int sy, int element)
{
    // Insert the element node to the leaf.
    const int nd_fc = il_get(&qt->nodes, node, node_idx_fc);
    il_set(&qt->nodes, node, node_idx_fc, il_insert(&qt->enodes));
    il_set(&qt->enodes, il_get(&qt->nodes, node, node_idx_fc), enode_idx_next, nd_fc);
    il_set(&qt->enodes, il_get(&qt->nodes, node, node_idx_fc), enode_idx_elt, element);

    // If the leaf is full, split it.
    if (il_get(&qt->nodes, node, node_idx_num) == qt->max_elements && depth < qt->max_depth)
    {
        int fc = 0, j = 0;
        IntList elts = {0};
        il_create(&elts, 1);

        // Transfer elements from the leaf node to a list of elements.
        while (il_get(&qt->nodes, node, node_idx_fc) != -1)
        {
            const int index = il_get(&qt->nodes, node, node_idx_fc);
            const int next_index = il_get(&qt->enodes, index, enode_idx_next);
            const int elt = il_get(&qt->enodes, index, enode_idx_elt);

            // Pop off the element node from the leaf and remove it from the qt.
            il_set(&qt->nodes, node, node_idx_fc, next_index);
            il_erase(&qt->enodes, index);

            // Insert element to the list.
            il_set(&elts, il_push_back(&elts), 0, elt);
        }

        // Start by allocating 4 child nodes.
        fc = il_insert(&qt->nodes);
        il_insert(&qt->nodes);
        il_insert(&qt->nodes);
        il_insert(&qt->nodes);
        il_set(&qt->nodes, node, node_idx_fc, fc);

        // Initialize the new child nodes.
        for (j=0; j < 4; ++j)
        {
            il_set(&qt->nodes, fc+j, node_idx_fc, -1);
            il_set(&qt->nodes, fc+j, node_idx_num, 0);
        }

        // Transfer the elements in the former leaf node to its new children.
        il_set(&qt->nodes, node, node_idx_num, -1);
        for (j=0; j < il_size(&elts); ++j)
            node_insert(qt, node, depth, mx, my, sx, sy, il_get(&elts, j, 0));
        il_destroy(&elts);
    }
    else
    {
        // Increment the leaf element count.
        il_set(&qt->nodes, node, node_idx_num, il_get(&qt->nodes, node, node_idx_num) + 1);
    }
}

static void push_node(IntList* nodes, int nd_index, int nd_depth, int nd_mx, int nd_my, int nd_sx, int nd_sy)
{
    const int back_idx = il_push_back(nodes);
    il_set(nodes, back_idx, nd_idx_mx, nd_mx);
    il_set(nodes, back_idx, nd_idx_my, nd_my);
    il_set(nodes, back_idx, nd_idx_sx, nd_sx);
    il_set(nodes, back_idx, nd_idx_sy, nd_sy);
    il_set(nodes, back_idx, nd_idx_index, nd_index);
    il_set(nodes, back_idx, nd_idx_depth, nd_depth);
}

static void find_leaves(IntList* out, const Quadtree* qt, int node, int depth, 
                        int mx, int my, int sx, int sy, 
                        int lft, int top, int rgt, int btm)
{
    IntList to_process = {0};
    il_create(&to_process, nd_num);
    push_node(&to_process, node, depth, mx, my, sx, sy);

    while (il_size(&to_process) > 0)
    {
        const int back_idx = il_size(&to_process) - 1;
        const int nd_mx = il_get(&to_process, back_idx, nd_idx_mx);
        const int nd_my = il_get(&to_process, back_idx, nd_idx_my);
        const int nd_sx = il_get(&to_process, back_idx, nd_idx_sx);
        const int nd_sy = il_get(&to_process, back_idx, nd_idx_sy);
        const int nd_index = il_get(&to_process, back_idx, nd_idx_index);
        const int nd_depth = il_get(&to_process, back_idx, nd_idx_depth);
        il_pop_back(&to_process);

        // If this node is a leaf, insert it to the list.
        if (il_get(&qt->nodes, nd_index, node_idx_num) != -1)
            push_node(out, nd_index, nd_depth, nd_mx, nd_my, nd_sx, nd_sy);
        else
        {
            // Otherwise push the children that intersect the rectangle.
            const int fc = il_get(&qt->nodes, nd_index, node_idx_fc);
            const int hx = nd_sx >> 1, hy = nd_sy >> 1;
            const int l = nd_mx-hx, t = nd_my-hx, r = nd_mx+hx, b = nd_my+hy;

            if (top <= nd_my)
            {
                if (lft <= nd_mx)
                    push_node(&to_process, fc+0, nd_depth+1, l,t,hx,hy);
                if (rgt > nd_mx)
                    push_node(&to_process, fc+1, nd_depth+1, r,t,hx,hy);
            }
            if (btm > nd_my)
            {
                if (lft <= nd_mx)
                    push_node(&to_process, fc+2, nd_depth+1, l,b,hx,hy);
                if (rgt > nd_mx)
                    push_node(&to_process, fc+3, nd_depth+1, r,b,hx,hy);
            }
        }
    }
    il_destroy(&to_process);
}

static void node_insert(Quadtree* qt, int index, int depth, int mx, int my, int sx, int sy, int element)
{
    // Find the leaves and insert the element to all the leaves found.
    int j = 0;
    IntList leaves = {0};
    
    const int lft = il_get(&qt->elts, element, elt_idx_lft);
    const int top = il_get(&qt->elts, element, elt_idx_top);
    const int rgt = il_get(&qt->elts, element, elt_idx_rgt);
    const int btm = il_get(&qt->elts, element, elt_idx_btm);

    il_create(&leaves, nd_num);
    find_leaves(&leaves, qt, index, depth, mx, my, sx, sy, lft, top, rgt, btm);
    for (j=0; j < il_size(&leaves); ++j)
    {
        const int nd_mx = il_get(&leaves, j, nd_idx_mx);
        const int nd_my = il_get(&leaves, j, nd_idx_my);
        const int nd_sx = il_get(&leaves, j, nd_idx_sx);
        const int nd_sy = il_get(&leaves, j, nd_idx_sy);
        const int nd_index = il_get(&leaves, j, nd_idx_index);
        const int nd_depth = il_get(&leaves, j, nd_idx_depth);
        leaf_insert(qt, nd_index, nd_depth, nd_mx, nd_my, nd_sx, nd_sy, element);
    }
    il_destroy(&leaves);
}

void qt_create(Quadtree* qt, int width, int height, int max_elements, int max_depth)
{
    qt->max_elements = max_elements;
    qt->max_depth = max_depth;
    qt->temp = 0;
    qt->temp_size = 0;
    il_create(&qt->nodes, node_num);
    il_create(&qt->elts, elt_num);
    il_create(&qt->enodes, enode_num);

    // Insert the root node to the qt.
    il_insert(&qt->nodes);
    il_set(&qt->nodes, 0, node_idx_fc, -1);
    il_set(&qt->nodes, 0, node_idx_num, 0);

    // Set the extents of the root node.
    qt->root_mx = width >> 1;
    qt->root_my = height >> 1;
    qt->root_sx = qt->root_mx;
    qt->root_sy = qt->root_my;
}

void qt_destroy(Quadtree* qt)
{
    il_destroy(&qt->nodes);
    il_destroy(&qt->elts);
    il_destroy(&qt->enodes);
    free(qt->temp);
}

int qt_insert(Quadtree* qt, int id, float x1, float y1, float x2, float y2)
{
    // Insert a new element.
    const int new_element = il_insert(&qt->elts);

    // Set the fields of the new element.
    il_set(&qt->elts, new_element, elt_idx_lft, floor_int(x1));
    il_set(&qt->elts, new_element, elt_idx_top, floor_int(y1));
    il_set(&qt->elts, new_element, elt_idx_rgt, floor_int(x2));
    il_set(&qt->elts, new_element, elt_idx_btm, floor_int(y2));
    il_set(&qt->elts, new_element, elt_idx_id, id);

    // Insert the element to the appropriate leaf node(s).
    node_insert(qt, 0, 0, qt->root_mx, qt->root_my, qt->root_sx, qt->root_sy, new_element);
    return new_element;
}

void qt_remove(Quadtree* qt, int element)
{
    // Find the leaves.
    int j = 0;
    IntList leaves = {0};

    const int lft = il_get(&qt->elts, element, elt_idx_lft);
    const int top = il_get(&qt->elts, element, elt_idx_top);
    const int rgt = il_get(&qt->elts, element, elt_idx_rgt);
    const int btm = il_get(&qt->elts, element, elt_idx_btm);

    il_create(&leaves, nd_num);
    find_leaves(&leaves, qt, 0, 0, qt->root_mx, qt->root_my, qt->root_sx, qt->root_sy, lft, top, rgt, btm);

    // For each leaf node, remove the element node.
    for (j=0; j < il_size(&leaves); ++j)
    {
        const int nd_index = il_get(&leaves, j, nd_idx_index);

        // Walk the list until we find the element node.
        int node_index = il_get(&qt->nodes, nd_index, node_idx_fc);
        int prev_index = -1;
        while (node_index != -1 && il_get(&qt->enodes, node_index, enode_idx_elt) != element)
        {
            prev_index = node_index;
            node_index = il_get(&qt->enodes, node_index, enode_idx_next);
        }

        if (node_index != -1)
        {
            // Remove the element node.
            const int next_index = il_get(&qt->enodes, node_index, enode_idx_next);
            if (prev_index == -1)
                il_set(&qt->nodes, nd_index, node_idx_fc, next_index);
            else
                il_set(&qt->enodes, prev_index, enode_idx_next, next_index);
            il_erase(&qt->enodes, node_index);

            // Decrement the leaf element count.
            il_set(&qt->nodes, nd_index, node_idx_num, il_get(&qt->nodes, nd_index, node_idx_num)-1);
        }
    }
    il_destroy(&leaves);

    // Remove the element.
    il_erase(&qt->elts, element);
}

void qt_query(Quadtree* qt, IntList* out, float x1, float y1, float x2, float y2, int omit_element)
{
    // Find the leaves that intersect the specified query rectangle.
    int j = 0;
    IntList leaves = {0};
    const int elt_cap = il_size(&qt->elts);

    const int qlft = floor_int(x1);
    const int qtop = floor_int(y1);
    const int qrgt = floor_int(x2);
    const int qbtm = floor_int(y2);

    if (qt->temp_size < elt_cap)
    {
        qt->temp_size = elt_cap;
        qt->temp = realloc(qt->temp, qt->temp_size * sizeof(*qt->temp));
        memset(qt->temp, 0, qt->temp_size * sizeof(*qt->temp));
    }

    // For each leaf node, look for elements that intersect.
    il_create(&leaves, nd_num);
    find_leaves(&leaves, qt, 0, 0, qt->root_mx, qt->root_my, qt->root_sx, qt->root_sy, qlft, qtop, qrgt, qbtm);

    il_clear(out);
    for (j=0; j < il_size(&leaves); ++j)
    {
        const int nd_index = il_get(&leaves, j, nd_idx_index);

        // Walk the list and add elements that intersect.
        int elt_node_index = il_get(&qt->nodes, nd_index, node_idx_fc);
        while (elt_node_index != -1)
        {
            const int element = il_get(&qt->enodes, elt_node_index, enode_idx_elt);
            const int lft = il_get(&qt->elts, element, elt_idx_lft);
            const int top = il_get(&qt->elts, element, elt_idx_top);
            const int rgt = il_get(&qt->elts, element, elt_idx_rgt);
            const int btm = il_get(&qt->elts, element, elt_idx_btm);
            if (!qt->temp[element] && element != omit_element && intersect(qlft,qtop,qrgt,qbtm, lft,top,rgt,btm))
            {
                il_set(out, il_push_back(out), 0, element);
                qt->temp[element] = 1;
            }
            elt_node_index = il_get(&qt->enodes, elt_node_index, enode_idx_next);
        }
    }
    il_destroy(&leaves);

    // Unmark the elements that were inserted.
    for (j=0; j < il_size(out); ++j)
        qt->temp[il_get(out, j, 0)] = 0;
}

void qt_cleanup(Quadtree* qt)
{
    IntList to_process = {0};
    il_create(&to_process, 1);

    // Only process the root if it's not a leaf.
    if (il_get(&qt->nodes, 0, node_idx_num) == -1)
    {
        // Push the root index to the stack.
        il_set(&to_process, il_push_back(&to_process), 0, 0);
    }

    while (il_size(&to_process) > 0)
    {
        // Pop a node from the stack.
        const int node = il_get(&to_process, il_size(&to_process)-1, 0);
        const int fc = il_get(&qt->nodes, node, node_idx_fc);
        int num_empty_leaves = 0;
        int j = 0;
        il_pop_back(&to_process);

        // Loop through the children.
        for (j=0; j < 4; ++j)
        {
            const int child = fc + j;

            // Increment empty leaf count if the child is an empty 
            // leaf. Otherwise if the child is a branch, add it to
            // the stack to be processed in the next iteration.
            if (il_get(&qt->nodes, child, node_idx_num) == 0)
                ++num_empty_leaves;
            else if (il_get(&qt->nodes, child, node_idx_num) == -1)
            {
                // Push the child index to the stack.
                il_set(&to_process, il_push_back(&to_process), 0, child);
            }
        }

        // If all the children were empty leaves, remove them and 
        // make this node the new empty leaf.
        if (num_empty_leaves == 4)
        {
            // Remove all 4 children in reverse order so that they 
            // can be reclaimed on subsequent insertions in proper
            // order.
            il_erase(&qt->nodes, fc + 3);
            il_erase(&qt->nodes, fc + 2);
            il_erase(&qt->nodes, fc + 1);
            il_erase(&qt->nodes, fc + 0);

            // Make this node the new empty leaf.
            il_set(&qt->nodes, node, node_idx_fc, -1);
            il_set(&qt->nodes, node, node_idx_num, 0);
        }
    }
    il_destroy(&to_process);
}

void qt_traverse(Quadtree* qt, void* user_data, QtNodeFunc* branch, QtNodeFunc* leaf)
{
    IntList to_process = {0};
    il_create(&to_process, nd_num);
    push_node(&to_process, 0, 0, qt->root_mx, qt->root_my, qt->root_sx, qt->root_sy);

    while (il_size(&to_process) > 0)
    {
        const int back_idx = il_size(&to_process) - 1;
        const int nd_mx = il_get(&to_process, back_idx, nd_idx_mx);
        const int nd_my = il_get(&to_process, back_idx, nd_idx_my);
        const int nd_sx = il_get(&to_process, back_idx, nd_idx_sx);
        const int nd_sy = il_get(&to_process, back_idx, nd_idx_sy);
        const int nd_index = il_get(&to_process, back_idx, nd_idx_index);
        const int nd_depth = il_get(&to_process, back_idx, nd_idx_depth);
        const int fc = il_get(&qt->nodes, nd_index, node_idx_fc);
        il_pop_back(&to_process);

        if (il_get(&qt->nodes, nd_index, node_idx_num) == -1)
        {
            // Push the children of the branch to the stack.
            const int hx = nd_sx >> 1, hy = nd_sy >> 1;
            const int l = nd_mx-hx, t = nd_my-hx, r = nd_mx+hx, b = nd_my+hy;
            push_node(&to_process, fc+0, nd_depth+1, l,t, hx,hy);
            push_node(&to_process, fc+1, nd_depth+1, r,t, hx,hy);
            push_node(&to_process, fc+2, nd_depth+1, l,b, hx,hy);
            push_node(&to_process, fc+3, nd_depth+1, r,b, hx,hy);
            if (branch)
                branch(qt, user_data, nd_index, nd_depth, nd_mx, nd_my, nd_sx, nd_sy);
        }
        else if (leaf)
            leaf(qt, user_data, nd_index, nd_depth, nd_mx, nd_my, nd_sx, nd_sy);
    }
    il_destroy(&to_process);
}

Temporary Conclusion

This isn't such a great answer but I'll try to come back and keep editing it. However, the above code should be very efficient on pretty much any language that allows contiguous arrays of plain old integers. As long as we have that contiguity guarantee, we can come up with a very cache-friendly quadtree which uses a very small mem footprint.

Please refer to the original answer for details on the overall approach.

Lucaslucca answered 20/1, 2018 at 9:7 Comment(11)
If doing this in c++, do you think that using std::vector will hinder performance at all instead of using IntList?Horsy
@DragonEnergy thank you for your excellent explanations. I have learned a considerable bit. At the same time, I believe I spotted a bug across your C and Java implementations. Namely, line const int l = nd_mx-hx, t = nd_my-hx, r = nd_mx+hx, b = nd_my+hy; which is reused multiple times, mixes up axes when assigning to t.Apiece
Based on the author's comments I allowed myself to repost this code with an appropriate readme. github.com/Antymon/quadtreeApiece
@DragonEnergy Something I noticed while looking at the sources: Various parts of the quad tree's code iterate over all elements in an int list via il_size, which includes elements that were previously freed using il_erase hence returning unreliable data. Am I missing something here?Subset
@MaxTruxa You're right! I'm so sorry about that. Normally I have this little lib I made for myself with C++ range-based fors and whatnot where I can skip over removed elements. In the dumb process of hasty porting to try to make the solution more accessible to people from all over sorts of languages and compilers, I forgot to include that functionality.Lucaslucca
@MaxTruxa Actually the idea of trying to make this uber portable was really dumb in hindsight. I just made my code very unreadable and rely on the lowest common denominator and I made this error in the process. Probably it would have been better for me to just focus on making the code as readable as possible and people can port it to whatever language they like. It's a bit awkward for me though since I lean heavily on libraries (third party or ones I built) to do stuff for me and I take a lot of things for granted.Lucaslucca
@MaxTruxa Well, I combed through the code this morning and it should actually be all right unless I missed something. Most of those iterations are using a stack pushing the root node to the top of the stack and then using the links of the tree (indices) to avoid traversing removed elements. That is what query does, for example, in find_leaves as well as qt_cleanup. No part of the code processes all the nodes in the the list -- that's strictly done through tree traversal although the code is extremely hard to read in this uber portable form. Lemme grab the C++03 version.Lucaslucca
@MaxTruxa Also all those list instances called to_process are basically just using a stack to avoid recursive function calls. We start off pushing the root to it and then loop over it, popping from the top of the stack and descending down and pushing children until we reach the leaves and the stack becomes empty. Here's the more readable C++03 code: pastebin.com/0pw54NFnLucaslucca
@SzymonBrych Also you might be interested in the C++03 version linked immediately above. It's a lot easier to read. My attempt in this answer was a very goofy attempt to try to let anyone in any programming language easily port the code and just use it but at tremendous cost to structure and readability. Again feel free to use it however you like, change it, sell stuff you make with it, whatever you want.Lucaslucca
@SzymonBrych Actually the code is still kinda goofy. For my answer I was looking at sharing two solutions for the stuff I was reusing to make my life easy: the first one I whipped up using a sort of large C++ library of code I built. Sharing it would probably overwhelm people. Then I have this smaller C library I use with a C SDK for wide compatibility with other languages and compilers. So I kind of bastardized C and C++ here and it's like not quite C (ex: not using realloc) but not quite proper C++ (ex: using memcpy for SmallList). Well, I hope it at least communicates the general idea.Lucaslucca
@SzymonBrych Again please change it, make it comfortable and readable to you, do whatever. Sell it. Take credit. I don't mind. I work on much harder stuff most of the time. I'm happy just seeing people not stumbling on simple things like collision queries involving just hundreds of elements or thousands when we have the hardware these days to do millions at 60+ FPS. It's reward enough for me to just stop seeing the industry struggle with this and understand the general mindset of playing nicely with the memory hierarchy.Lucaslucca
L
37

Dirty Trick: Uniform Sizes

For this answer I'll cover an underhanded trick which can allow your simulation to run an order of magnitude faster if the data is appropriate (which it often will be in many video games, e.g.). It can get you from tens of thousands to hundreds of thousands of agents, or hundreds of thousands of agents to millions of agents. I haven't applied it in any of the demonstrations shown in my answers so far since it's a bit of a cheat, but I have used it in production and it can make a world of difference. And funnily I don't see it discussed that often. Actually I've never seen it discussed which is weird.

So let's come back to the Lord of the Rings example. We have a lot of "humanish-sized" units like humans, elves, dwarves, orcs, and hobbits, and we also have some ginormous units like dragons and ents.

The "humanish-sized" units don't vary that much in size. A hobbit might be four feet tall and a bit stocky, an orc might be 6'4. There's some difference but it's not an epic difference. It's not an order of magnitude difference.

So what happens if we put a bounding sphere/box around a hobbit which is the size of an orc's bounding sphere/box just for the sake of coarse intersection queries (before we drill down to checking for more true collision on a granular/fine level)? There's a little bit of wasted negative space but something really interesting happens.

If we can foresee such an upper bound on common-case units, we can store them in a data structure which assumes that all things have a uniform upper bound size. A couple of really interesting things happen in this case:

  1. We don't have to store a size with each element. The data structure can assume that all elements inserted to it have the same uniform size (just for the purpose of coarse intersection queries). This can almost halve memory use for elements in many scenarios and it naturally speeds up traversal when we have less memory/data to access per element.
  2. We can store elements in just one cell/node, even for tight representations that don't have variable-sized AABBs stored in cells/nodes.

Storing Just One Point

This second part is tricky but imagine we have a case like this:

enter image description here

Well, if we look at the green circle and search its radius, we'd end up missing the center point of the blue circle if it's only stored as a single point in our spatial index. But what if we search an area twice the radius of our circles?

enter image description here

In that case we'd find the intersection even if the blue circle is only stored as a single point in our spatial index (the center point in orange). Just to kind of visually show that this works:

enter image description here

In this case the circles aren't intersecting and we can see that the center point is outside of even the expanded doubled search radius. So as long as we search twice the radius in a spatial index that assumes that all elements have a uniform upper-bound size, we're guaranteed to find them in a coarse query if we search an area twice the upper-bound radius (or twice the rectangular half-size for AABBs).

Now that might seem wasteful like it'd check more cells/nodes than necessary in our search queries, but that's only because I drew the diagram for illustrative purposes. If you use this strategy, you'd use it for elements whose sizes are generally a fraction of the size of a single leaf node/cell.

Huge Optimization

So a huge optimization you can apply is to separate your content into 3 distinct types:

  1. A dynamic set (constantly moving around and animating) with a common-case upper bound like humans, orcs, elves, and hobbits. We basically put the same-size bounding box/sphere around all of these agents. Here you might use a tight representation like a tight quadtree or a tight grid and it'd only store a single point for each element. You might also use another instance of this same structure for super teeny elements like fairies and wisps with a different common-case upper-bound size.
  2. A dynamic set bigger than any foreseeable common-case upper bound like dragons and ents with very unusual sizes. Here you might use a loose representation like a loose quadtree or my "loose/tight double grid".
  3. A static set where you can afford structures that take longer to build or are very inefficient to update, like a quadtree for static data that stores everything in a perfectly contiguous way. In this case it doesn't matter how inefficient the data structure is to update provided it provides the fastest search queries, since you're never going to be updating it. You might use this for elements in your world like castles, barricades, and boulders.

So this idea of separating common-case elements with uniform upper-bound extents (bounding spheres or boxes) can be an extremely useful optimization strategy if you can apply it. It's also one I don't see discussed. I often see developers talking about separating dynamic and static content, but you can get just as much of an improvement if not more by further grouping common-case similarly-sized dynamic elements and treating them as though they have uniform upper-bound sizes for your coarse collision tests which has the effect of allowing them to be stored like an infinitesimally small point which is only inserted to one leaf node in your tight data structure.

On the Benefits of "Cheating"

So this solution isn't particularly clever or interesting but the mindset behind it is something that I feel is worth mentioning, at least for those who are like me. I wasted a good chunk of my career searching for the "uber" solutions: the one-size-fits all data structures and algorithms which can beautifully handle any use case with the hopes of being able to take a little extra time upfront to get it right and then reuse it like crazy far into the future and across disparate use cases, not to mention working with many colleagues who sought the same.

And in scenarios where performance cannot be compromised too much in favor of productivity, zealously seeking out such solutions can lead to neither performance nor productivity. So sometimes it's good to just stop and look at the nature of the particular data requirements for a software and see if we can "cheat" and create some "tailored", more narrowly-applicable solutions against those peculiar requirements, as in this example. Sometimes that's the most useful way to get a good combination of both performance and productivity in cases where one cannot be compromised too much in favor of the other.

Lucaslucca answered 23/1, 2018 at 11:22 Comment(0)
L
24
  1. Java IntList

I hope people don't mind me posting a third answer but I ran out of the character limit again. I ended porting the C code in the second answer to Java. The Java port might be easier to refer to for people porting to object-oriented languages.

class IntList
{
    private int data[];
    private int num_fields = 0;
    private int num = 0;
    private int cap = 128;
    private int free_element = -1;

    // Creates a new list of elements which each consist of integer fields.
    // 'start_num_fields' specifies the number of integer fields each element has.
    public IntList(int start_num_fields)
    {
        num_fields = start_num_fields;
        data = new int[cap * start_num_fields];
    }

    // Returns the number of elements in the list.
    int size()
    {
        return num;
    }

    // Returns the value of the specified field for the nth element.
    int get(int n, int field)
    {
        assert n >= 0 && n < num && field >= 0 && field < num_fields;
        return data[n*num_fields + field];
    }

    // Sets the value of the specified field for the nth element.
    void set(int n, int field, int val)
    {
        assert n >= 0 && n < num && field >= 0 && field < num_fields;
        data[n*num_fields + field] = val;
    }

    // Clears the list, making it empty.
    void clear()
    {
        num = 0;
        free_element = -1;
    }

    // Inserts an element to the back of the list and returns an index to it.
    int pushBack()
    {
        final int new_pos = (num+1) * num_fields;
        
        // If the list is full, we need to reallocate the buffer to make room
        // for the new element.
        if (new_pos > cap)
        {
            // Use double the size for the new capacity.
            final int new_cap = new_pos * 2;

            // Allocate new array and copy former contents.
            int new_array[] = new int[new_cap];
            System.arraycopy(data, 0, new_array, 0, cap);
            data = new_array;
    
            // Set the old capacity to the new capacity.
            cap = new_cap;
        }
        return num++;
    }
    
    // Removes the element at the back of the list.
    void popBack()
    {
        // Just decrement the list size.
        assert num > 0;
        --num;
    }

    // Inserts an element to a vacant position in the list and returns an index to it.
    int insert()
    {
        // If there's a free index in the free list, pop that and use it.
        if (free_element != -1)
        {
            final int index = free_element;
            final int pos = index * num_fields;
    
            // Set the free index to the next free index.
            free_element = data[pos];
    
            // Return the free index.
            return index;
        }
        // Otherwise insert to the back of the array.
        return pushBack();
    }

    // Removes the nth element in the list.
    void erase(int n)
    {
        // Push the element to the free list.
        final int pos = n * num_fields;
        data[pos] = free_element;
        free_element = n;
    }
}

Java Quadtree

And here is the quadtree in Java (sorry if it's not very idiomatic; I haven't written Java in about a decade or so and have forgotten many things):

interface IQtVisitor
{
    // Called when traversing a branch node.
    // (mx, my) indicate the center of the node's AABB.
    // (sx, sy) indicate the half-size of the node's AABB.
    void branch(Quadtree qt, int node, int depth, int mx, int my, int sx, int sy);

    // Called when traversing a leaf node.
    // (mx, my) indicate the center of the node's AABB.
    // (sx, sy) indicate the half-size of the node's AABB.
    void leaf(Quadtree qt, int node, int depth, int mx, int my, int sx, int sy);
}

class Quadtree
{
    // Creates a quadtree with the requested extents, maximum elements per leaf, and maximum tree depth.
    Quadtree(int width, int height, int start_max_elements, int start_max_depth)
    {
        max_elements = start_max_elements;
        max_depth = start_max_depth;

        // Insert the root node to the qt.
        nodes.insert();
        nodes.set(0, node_idx_fc, -1);
        nodes.set(0, node_idx_num, 0);

        // Set the extents of the root node.
        root_mx = width / 2;
        root_my = height / 2;
        root_sx = root_mx;
        root_sy = root_my;
    }

    // Outputs a list of elements found in the specified rectangle.
    public int insert(int id, float x1, float y1, float x2, float y2)
    {
        // Insert a new element.
        final int new_element = elts.insert();
    
        // Set the fields of the new element.
        elts.set(new_element, elt_idx_lft, floor_int(x1));
        elts.set(new_element, elt_idx_top, floor_int(y1));
        elts.set(new_element, elt_idx_rgt, floor_int(x2));
        elts.set(new_element, elt_idx_btm, floor_int(y2));
        elts.set(new_element, elt_idx_id, id);
    
        // Insert the element to the appropriate leaf node(s).
        node_insert(0, 0, root_mx, root_my, root_sx, root_sy, new_element);
        return new_element;
    }

    // Removes the specified element from the tree.
    public void remove(int element)
    {
        // Find the leaves.
        final int lft = elts.get(element, elt_idx_lft);
        final int top = elts.get(element, elt_idx_top);
        final int rgt = elts.get(element, elt_idx_rgt);
        final int btm = elts.get(element, elt_idx_btm);
        IntList leaves = find_leaves(0, 0, root_mx, root_my, root_sx, root_sy, lft, top, rgt, btm);
    
        // For each leaf node, remove the element node.
        for (int j=0; j < leaves.size(); ++j)
        {
            final int nd_index = leaves.get(j, nd_idx_index);
    
            // Walk the list until we find the element node.
            int node_index = nodes.get(nd_index, node_idx_fc);
            int prev_index = -1;
            while (node_index != -1 && enodes.get(node_index, enode_idx_elt) != element)
            {
                prev_index = node_index;
                node_index = enodes.get(node_index, enode_idx_next);
            }
    
            if (node_index != -1)
            {
                // Remove the element node.
                final int next_index = enodes.get(node_index, enode_idx_next);
                if (prev_index == -1)
                    nodes.set(nd_index, node_idx_fc, next_index);
                else
                    enodes.set(prev_index, enode_idx_next, next_index);
                enodes.erase(node_index);

                // Decrement the leaf element count.
                nodes.set(nd_index, node_idx_num, nodes.get(nd_index, node_idx_num)-1);
            }
        }

        // Remove the element.
        elts.erase(element);
    }

    // Cleans up the tree, removing empty leaves.
    public void cleanup()
    {
        IntList to_process = new IntList(1);

        // Only process the root if it's not a leaf.
        if (nodes.get(0, node_idx_num) == -1)
        {
            // Push the root index to the stack.
            to_process.set(to_process.pushBack(), 0, 0);
        }
    
        while (to_process.size() > 0)
        {
            // Pop a node from the stack.
            final int node = to_process.get(to_process.size()-1, 0);
            final int fc = nodes.get(node, node_idx_fc);
            int num_empty_leaves = 0;
            to_process.popBack();
    
            // Loop through the children.
            for (int j=0; j < 4; ++j)
            {
                final int child = fc + j;
    
                // Increment empty leaf count if the child is an empty 
                // leaf. Otherwise if the child is a branch, add it to
                // the stack to be processed in the next iteration.
                if (nodes.get(child, node_idx_num) == 0)
                    ++num_empty_leaves;
                else if (nodes.get(child, node_idx_num) == -1)
                {
                    // Push the child index to the stack.
                    to_process.set(to_process.pushBack(), 0, child);
                }
            }
    
            // If all the children were empty leaves, remove them and 
            // make this node the new empty leaf.
            if (num_empty_leaves == 4)
            {
                // Remove all 4 children in reverse order so that they 
                // can be reclaimed on subsequent insertions in proper
                // order.
                nodes.erase(fc + 3);
                nodes.erase(fc + 2);
                nodes.erase(fc + 1);
                nodes.erase(fc + 0);
    
                // Make this node the new empty leaf.
                nodes.set(node, node_idx_fc, -1);
                nodes.set(node, node_idx_num, 0);
            }
        }
    }

    // Returns a list of elements found in the specified rectangle.
    public IntList query(float x1, float y1, float x2, float y2)
    {
        return query(x1, y1, x2, y2, -1);
    }
    
    // Returns a list of elements found in the specified rectangle excluding the
    // specified element to omit.
    public IntList query(float x1, float y1, float x2, float y2, int omit_element)
    {
        IntList out = new IntList(1);

        // Find the leaves that intersect the specified query rectangle.
        final int qlft = floor_int(x1);
        final int qtop = floor_int(y1);
        final int qrgt = floor_int(x2);
        final int qbtm = floor_int(y2);
        IntList leaves = find_leaves(0, 0, root_mx, root_my, root_sx, root_sy, qlft, qtop, qrgt, qbtm);

        if (temp_size < elts.size())
        {
            temp_size = elts.size();
            temp = new boolean[temp_size];;
        }
    
        // For each leaf node, look for elements that intersect.
        for (int j=0; j < leaves.size(); ++j)
        {
            final int nd_index = leaves.get(j, nd_idx_index);
    
            // Walk the list and add elements that intersect.
            int elt_node_index = nodes.get(nd_index, node_idx_fc);
            while (elt_node_index != -1)
            {
                final int element = enodes.get(elt_node_index, enode_idx_elt);
                final int lft = elts.get(element, elt_idx_lft);
                final int top = elts.get(element, elt_idx_top);
                final int rgt = elts.get(element, elt_idx_rgt);
                final int btm = elts.get(element, elt_idx_btm);
                if (!temp[element] && element != omit_element && intersect(qlft,qtop,qrgt,qbtm, lft,top,rgt,btm))
                {
                    out.set(out.pushBack(), 0, element);
                    temp[element] = true;
                }
                elt_node_index = enodes.get(elt_node_index, enode_idx_next);
            }
        }

        // Unmark the elements that were inserted.
        for (int j=0; j < out.size(); ++j)
            temp[out.get(j, 0)] = false;
        return out;
    }

    // Traverses all the nodes in the tree, calling 'branch' for branch nodes and 'leaf' 
    // for leaf nodes.
    public void traverse(IQtVisitor visitor)
    {
        IntList to_process = new IntList(nd_num);
        pushNode(to_process, 0, 0, root_mx, root_my, root_sx, root_sy);
    
        while (to_process.size() > 0)
        {
            final int back_idx = to_process.size() - 1;
            final int nd_mx = to_process.get(back_idx, nd_idx_mx);
            final int nd_my = to_process.get(back_idx, nd_idx_my);
            final int nd_sx = to_process.get(back_idx, nd_idx_sx);
            final int nd_sy = to_process.get(back_idx, nd_idx_sy);
            final int nd_index = to_process.get(back_idx, nd_idx_index);
            final int nd_depth = to_process.get(back_idx, nd_idx_depth);
            final int fc = nodes.get(nd_index, node_idx_fc);
            to_process.popBack();
    
            if (nodes.get(nd_index, node_idx_num) == -1)
            {
                // Push the children of the branch to the stack.
                final int hx = nd_sx >> 1, hy = nd_sy >> 1;
                final int l = nd_mx-hx, t = nd_my-hx, r = nd_mx+hx, b = nd_my+hy;
                pushNode(to_process, fc+0, nd_depth+1, l,t, hx,hy);
                pushNode(to_process, fc+1, nd_depth+1, r,t, hx,hy);
                pushNode(to_process, fc+2, nd_depth+1, l,b, hx,hy);
                pushNode(to_process, fc+3, nd_depth+1, r,b, hx,hy);
                visitor.branch(this, nd_index, nd_depth, nd_mx, nd_my, nd_sx, nd_sy);
            }
            else
                visitor.leaf(this, nd_index, nd_depth, nd_mx, nd_my, nd_sx, nd_sy);
        }
    }

    private static int floor_int(float val)
    {
        return (int)val;
    }
    
    private static boolean intersect(int l1, int t1, int r1, int b1,
                                     int l2, int t2, int r2, int b2)
    {
        return l2 <= r1 && r2 >= l1 && t2 <= b1 && b2 >= t1;
    }

    private static void pushNode(IntList nodes, int nd_index, int nd_depth, int nd_mx, int nd_my, int nd_sx, int nd_sy)
    {
        final int back_idx = nodes.pushBack();
        nodes.set(back_idx, nd_idx_mx, nd_mx);
        nodes.set(back_idx, nd_idx_my, nd_my);
        nodes.set(back_idx, nd_idx_sx, nd_sx);
        nodes.set(back_idx, nd_idx_sy, nd_sy);
        nodes.set(back_idx, nd_idx_index, nd_index);
        nodes.set(back_idx, nd_idx_depth, nd_depth);
    }

    private IntList find_leaves(int node, int depth, 
                                int mx, int my, int sx, int sy, 
                                int lft, int top, int rgt, int btm)
    {
        IntList leaves = new IntList(nd_num);
        IntList to_process = new IntList(nd_num);
        pushNode(to_process, node, depth, mx, my, sx, sy);
    
        while (to_process.size() > 0)
        {
            final int back_idx = to_process.size() - 1;
            final int nd_mx = to_process.get(back_idx, nd_idx_mx);
            final int nd_my = to_process.get(back_idx, nd_idx_my);
            final int nd_sx = to_process.get(back_idx, nd_idx_sx);
            final int nd_sy = to_process.get(back_idx, nd_idx_sy);
            final int nd_index = to_process.get(back_idx, nd_idx_index);
            final int nd_depth = to_process.get(back_idx, nd_idx_depth);
            to_process.popBack();

            // If this node is a leaf, insert it to the list.
            if (nodes.get(nd_index, node_idx_num) != -1)
                pushNode(leaves, nd_index, nd_depth, nd_mx, nd_my, nd_sx, nd_sy);
            else
            {
                // Otherwise push the children that intersect the rectangle.
                final int fc = nodes.get(nd_index, node_idx_fc);
                final int hx = nd_sx / 2, hy = nd_sy / 2;
                final int l = nd_mx-hx, t = nd_my-hx, r = nd_mx+hx, b = nd_my+hy;
    
                if (top <= nd_my)
                {
                    if (lft <= nd_mx)
                        pushNode(to_process, fc+0, nd_depth+1, l,t,hx,hy);
                    if (rgt > nd_mx)
                        pushNode(to_process, fc+1, nd_depth+1, r,t,hx,hy);
                }
                if (btm > nd_my)
                {
                    if (lft <= nd_mx)
                        pushNode(to_process, fc+2, nd_depth+1, l,b,hx,hy);
                    if (rgt > nd_mx)
                        pushNode(to_process, fc+3, nd_depth+1, r,b,hx,hy);
                }
            }
        }
        return leaves;
    }

    private void node_insert(int index, int depth, int mx, int my, int sx, int sy, int element)
    {
        // Find the leaves and insert the element to all the leaves found.
        final int lft = elts.get(element, elt_idx_lft);
        final int top = elts.get(element, elt_idx_top);
        final int rgt = elts.get(element, elt_idx_rgt);
        final int btm = elts.get(element, elt_idx_btm);
        IntList leaves = find_leaves(index, depth, mx, my, sx, sy, lft, top, rgt, btm);

        for (int j=0; j < leaves.size(); ++j)
        {
            final int nd_mx = leaves.get(j, nd_idx_mx);
            final int nd_my = leaves.get(j, nd_idx_my);
            final int nd_sx = leaves.get(j, nd_idx_sx);
            final int nd_sy = leaves.get(j, nd_idx_sy);
            final int nd_index = leaves.get(j, nd_idx_index);
            final int nd_depth = leaves.get(j, nd_idx_depth);
            leaf_insert(nd_index, nd_depth, nd_mx, nd_my, nd_sx, nd_sy, element);
        }
    }

    private void leaf_insert(int node, int depth, int mx, int my, int sx, int sy, int element)
    {
        // Insert the element node to the leaf.
        final int nd_fc = nodes.get(node, node_idx_fc);
        nodes.set(node, node_idx_fc, enodes.insert());
        enodes.set(nodes.get(node, node_idx_fc), enode_idx_next, nd_fc);
        enodes.set(nodes.get(node, node_idx_fc), enode_idx_elt, element);
    
        // If the leaf is full, split it.
        if (nodes.get(node, node_idx_num) == max_elements && depth < max_depth)
        {
            // Transfer elements from the leaf node to a list of elements.
            IntList elts = new IntList(1);
            while (nodes.get(node, node_idx_fc) != -1)
            {
                final int index = nodes.get(node, node_idx_fc);
                final int next_index = enodes.get(index, enode_idx_next);
                final int elt = enodes.get(index, enode_idx_elt);
    
                // Pop off the element node from the leaf and remove it from the qt.
                nodes.set(node, node_idx_fc, next_index);
                enodes.erase(index);
    
                // Insert element to the list.
                elts.set(elts.pushBack(), 0, elt);
            }
    
            // Start by allocating 4 child nodes.
            final int fc = nodes.insert();
            nodes.insert();
            nodes.insert();
            nodes.insert();
            nodes.set(node, node_idx_fc, fc);
    
            // Initialize the new child nodes.
            for (int j=0; j < 4; ++j)
            {
                nodes.set(fc+j, node_idx_fc, -1);
                nodes.set(fc+j, node_idx_num, 0);
            }
    
            // Transfer the elements in the former leaf node to its new children.
            nodes.set(node, node_idx_num, -1);
            for (int j=0; j < elts.size(); ++j)
                node_insert(node, depth, mx, my, sx, sy, elts.get(j, 0));
        }
        else
        {
            // Increment the leaf element count.
            nodes.set(node, node_idx_num, nodes.get(node, node_idx_num) + 1);
        }
    }


    // ----------------------------------------------------------------------------------------
    // Element node fields:
    // ----------------------------------------------------------------------------------------
    // Points to the next element in the leaf node. A value of -1 
    // indicates the end of the list.
    static final int enode_idx_next = 0;

    // Stores the element index.
    static final int enode_idx_elt = 1;

    // Stores all the element nodes in the quadtree.
    private IntList enodes = new IntList(2);

    // ----------------------------------------------------------------------------------------
    // Element fields:
    // ----------------------------------------------------------------------------------------
    // Stores the rectangle encompassing the element.
    static final int elt_idx_lft = 0, elt_idx_top = 1, elt_idx_rgt = 2, elt_idx_btm = 3;

    // Stores the ID of the element.
    static final int elt_idx_id = 4;

    // Stores all the elements in the quadtree.
    private IntList elts = new IntList(5);

    // ----------------------------------------------------------------------------------------
    // Node fields:
    // ----------------------------------------------------------------------------------------
    // Points to the first child if this node is a branch or the first element
    // if this node is a leaf.
    static final int node_idx_fc = 0;

    // Stores the number of elements in the node or -1 if it is not a leaf.
    static final int node_idx_num = 1;

    // Stores all the nodes in the quadtree. The first node in this
    // sequence is always the root.
    private IntList nodes = new IntList(2);

    // ----------------------------------------------------------------------------------------
    // Node data fields:
    // ----------------------------------------------------------------------------------------
    static final int nd_num = 6;

    // Stores the extents of the node using a centered rectangle and half-size.
    static final int nd_idx_mx = 0, nd_idx_my = 1, nd_idx_sx = 2, nd_idx_sy = 3;

    // Stores the index of the node.
    static final int nd_idx_index = 4;

    // Stores the depth of the node.
    static final int nd_idx_depth = 5;

    // ----------------------------------------------------------------------------------------
    // Data Members
    // ----------------------------------------------------------------------------------------
    // Temporary buffer used for queries.
    private boolean temp[];

    // Stores the size of the temporary buffer.
    private int temp_size = 0;

    // Stores the quadtree extents.
    private int root_mx, root_my, root_sx, root_sy;

    // Maximum allowed elements in a leaf before the leaf is subdivided/split unless
    // the leaf is at the maximum allowed tree depth.
    private int max_elements;

    // Stores the maximum depth allowed for the quadtree.
    private int max_depth;
}

Temporary Conclusion

Again sorry, it's a bit of a code dump answer. I will come back and edit it and try to explain more and more things.

Please refer to the original answer for details on the overall approach.

Lucaslucca answered 20/1, 2018 at 9:9 Comment(20)
Great set of posts! Do you have an example on how to go about using a quadtree like this? I assume at the start of a frame, for each agent, you would call Insert(), to check for a collsion an agent could call Query(), the Cleanup() method is called at the end of the frame. I'm struggling to see where Traverse() comes in.Anyplace
@MattClegg For me it's just start off inserting agents to tree, then on every timestep remove and reinsert them. I don't specialize the case of them moving although it might offer some very small optimization opportunities. You can think of it like the "agents live directly in the hierarchy".Lucaslucca
This entire series of answers has been a massive help, and this Java code tied it all together. I'm learning in C# and it was indeed much easier for me to understand. If anyone is using it, make sure to fix the bug mentioned in the C version, and note that IntList creates a lot of heap allocations in the local methods like find_leaves. Changing the 128 to 16 in the data and num fields provides an considerable and also extremely easy speedup.Accountable
I took at shot at implementing my own version of SmallList in C# and replacing the temporary uses of IntList, described here: #62524465 I'm sure there's tons of room for improvement, but so far it seems to work!Accountable
@Appleguysnake Awesome! It is actually a shame that more standard libraries don't include it. Funnily they updated std::string using an SBO (small buffer optimization) just like SmallList but they didn't provide the equivalent of std::vector (similar to ArrayList) with one. It can be incredibly more efficient for creating large numbers of small temporary lists in rapid succession since it can just utilize the stack in those cases. Actually a lot of C++ code I've profiled is slower than C equivalents using fixed buffers on the stack for that precise reason, and small lists...Lucaslucca
@Appleguysnake ...using SBOs in the standard lib, used appropriately by C++ programmers, would wipe out those hotspots. The main thing you have to be careful about is not to store these persistently in large containers that are kept around, since they do occupy a lot of memory for the SBO. They're meant to be used for temporary lists you just create, access, and quickly discard -- local variables in a function.Lucaslucca
@Appleguysnake Also in terms of the SBO size, you can actually make it quite large. It's common practice in C to use sizes as large as 512 or sometimes even like 2048 bytes for a stack buffer. There is a practical limit since we don't want a stack overflow, but you can be quite generous with the space you use. It won't affect its performance when used on the stack. A clever and generalized solution might look at the size of the element you are storing and adjust the SBO size based on that, like sbo_capacity = divide_ceil(1024, sizeof(Type)).Lucaslucca
@Appleguysnake My dream programming language would actually have a built-in list type that uses SBOs when created as local temporary variables in a function, and avoid using them when they are stored in a larger container. Then we could just have one generalized random-access sequence we can use without thinking much.Lucaslucca
@Appleguysnake One small note but if the way you are using the structure causes your queries (assuming they are used for collision) to often return more than 16 elements, it might get a bit bottlenecky if that's a fairly common case. It is perfectly great though to use a larger SBO for the small list, but I would have a look at the average number of elements returned from a query. A decent-performing range should often be something like 4-16 elements returned so I would try to adjust the resolution (size of the grid cell) to try to make that the common case if possible.Lucaslucca
@DragonEnergy Woah, thanks! The change to 16 was just an optimization in the straight C# port where the arrays aren't stack allocated. As I worked through it more, I used array pools and things like that to get around the limitation of not being able to use the stack. It was a good way to learn the limitations of working in C# and where trying to go low-level in a higher level language stops paying off.Accountable
@DragonEnergy Also, this is what I ended up with after finally implementing all your examples - github.com/Appleguysnake/DragonSpace-Demo You said you just wanted to know what people do with your code, so here's one example! Also, you mentioned looking for other people using the same techniques. I learned Unity defaults to sweep-and-prune, but also has an option to use grids for collision detection. I also found this example from 1999 talking about hierarchical grids - tulrich.com/geekstuff/partitioning.htmlAccountable
@DragonEnergy Also the game of life program Golly has tons of interesting micro-optimizations that break the grid into multiple levels of hierarchical cells, each optimized for cache efficiency. It's absurdly fast without needing to rely on parallelism. I hope some of those help in your search for terminology and peers! And I'll tempt the wrath of SO by saying thanks, this is genuinely one of the best explanations of a programming topic I've ever read, it was a massive help to me.Accountable
@Appleguysnake Awesome stuff! Cheers! TBH I was expecting downvotes left and right on this one since I got so rambly with it and I anticipated the question would be closed for being so broad. I like those though -- I don't really have much programming experience I can share with questions where someone is just asking why their linked list implementation segfaults or whatever. I found this question as an excuse to just ramble away as I like to do. :-D The ultimate reward for me just sharing what I've learned is just see more efficient stuff out there...Lucaslucca
@Appleguysnake ... it bothers me that there are retro-style indie games out there today just using 2D sprites that have framerate hiccups with just dozens or hundreds of entities on a given level/map when people from the SNES/Megadrive days had no problems with that. So I have a soft spot for starting gamedevs in particular and want to share not only ideas related to spatial indices like quadtrees but the broad way to think about this stuff and get the realtime 60+ sweet and consistent FPS we all want.Lucaslucca
@Appleguysnake From my standpoint, and I'm kind of weird and divisive on this, it helps to microtune your central data structures (with profiler in hand) in any application/architecture you design and build. It's not to get max efficiency but to make it so you don't have to worry about efficiency everywhere else. You get the critical areas fast and generalized and then you can just use a high-level scripting language or whatever boosts productivity the most for everything else and be confident that those frame rates will remain smooth and silky.Lucaslucca
@Appleguysnake For stacks, one thing to keep in mind and remember is that the speed of the stack isn't so much related to having a dedicated stack pointer register or an operating system that preallocates stack space for each thread. It's mostly from just accessing the same memory regions over and over with simple push/pop LIFO patterns which plays well to temporal and spatial locality. So if you are working with a language that doesn't let you allocate things on the stack so much, there might be another way to match the performance.Lucaslucca
@Appleguysnake I like how C# lets you bypass GC in a contiguous array with struct to my understanding unless it stores references to other objects. The critical thing to me is not whether locals are allocated on the stack. That's a small thing. The big thing in my domain is that the language/compiler/interpreter guarantees that foo[1] is adjacent in memory to foo[0] and remains that way. That is the key part to start playing well with our CPU caches when we traverse enormous arrays of data. The array elements should be arranged contiguously as far as mem layouts.Lucaslucca
@Appleguysnake I've been thinking of using C# as kind of like an embedded scripting language for a new engine I'm working on. I don't have much expertise in it though, but it seems to have much more stuff that would expand its relevance in more critical hot paths than most without writing super ugly code. I have been leaning on Lua and Python a lot over the years but both use GC for everything and they're dynamically-typed (I hate dynamic typing since it turns a lot of compile-time errors into runtime bugs). Only problem 2 me is the way that it isn't so portable with the exception of Mono.Lucaslucca
@Appleguysnake BTW, have you looked into entity-component systems? I find that so beautiful from an architectural standpoint. I am not such a big fan of OOP. I like functional/procedural more -- especially functional -- and ECS plays well to the functional and procedural mentality. Decoupling is more important to me than info hiding or inheritance. I design all architectures that way these days -- with ECS. Just query component type of interest and plow through an array of them in systems. It's beautiful. I totally recommend trying it for any medium-large scale project if you haven't already.Lucaslucca
I actually learned about ECS and started hacking together my own system on top of Unity. I was getting the hang of it before snapping out of it and realizing how silly and redundant that was, especially for a game that wasn't even prototyped! Unity also has an ECS system in development but it sounds like it's far from ready from primetime. Though I've been told that you can get 90% of the performance gains in normal Unity code with some of the API functionality they've added. A project for another time!Accountable

© 2022 - 2024 — McMap. All rights reserved.