Hilbert sort by divide and conquer algorithm?
Asked Answered
G

3

22

I'm trying to sort d-dimensional data vectors by their Hilbert order, for bulk-loading a spatial index.

However, I do not want to compute the Hilbert value for each point explicitly, which in particular requires setting a particular precision. In high-dimensional data, this involves a precision such as 32*d bits, which becomes quite messy to do efficiently. When the data is distributed unevenly, some of these calculations are unnecessary, and extra precision for parts of the data set are necessary.

Instead, I'm trying to do a partitioning approach. When you look at the 2D first order hilbert curve

1   4
|   |
2---3

I'd split the data along the x-axis first, so that the first part (not necessarily containing half of the objects!) will consist of 1 and 2 (not yet sorted) and the second part will have objects from 3 and 4 only. Next, I'd split each half again, on the Y axis, but reverse the order in 3-4.

So essentially, I want to perform a divide-and-conquer strategy (closely related to QuickSort - on evenly distributed data this should even be optimal!), and only compute the necessary "bits" of the hilbert index as needed. So assuming there is a single object in "1", then there is no need to compute the full representation of it; and if the objects are evenly distributed, partition sizes will drop quickly.

I do know the usual textbook approach of converting to long, gray-coding, dimension interleaving. This is not what I'm looking for (there are plenty of examples of this available). I explicitly want a lazy divide-and-conquer sorting only. Plus, I need more than 2D.

Does anyone know of an article or hilbert-sorting algorithm that works this way? Or a key idea how to get the "rotations" right, which representation to choose for this? In particular in higher dimensionalities... in 2D it is trivial; 1 is rotated +y, +x, while 4 is -y,-x (rotated and flipped). But in higher dimensionalities this gets more tricky, I guess.

(The result should of course be the same as when sorting the objects by their hilbert order with a sufficiently large precision right away; I'm just trying to save the time computing the full representation when not needed, and having to manage it. Many people keep a hashmap "object to hilbert number" that is rather expensive.)

Similar approaches should be possible for Peano curves and Z-curve, and probably a bit easier to implement... I should probably try these first (Z-curve is already working - it indeed boils down to something closely resembling a QuickSort, using the appropriate mean/grid value as virtual pivot and cycling through dimensions for each iteration).

Edit: see below for how I solved it for Z and peano curves. It is also working for 2D Hilbert curves already. But I do not have the rotations and inversion right yet for Hilbert curves.

Gybe answered 10/12, 2011 at 20:18 Comment(5)
can you use an OctTree instead?Sulphurize
No, octrees don't scale up well to imbalanced and high-dimensional data. They are good for 3D games that cover space rather evenly.Gybe
This sounds like a dynamic Hilbert R-tree, en.wikipedia.org/wiki/Hilbert_R-treeBoart
Yes, I intend to use this for bulk-loading an R-tree by presorting the objects along to their Hilbert order ("Hilbert packing"). I do not intend to store the entry and/or exit points though, so it won't be a full "dynamic Hilbert R-tree".Gybe
This link may be relevant.Thorton
T
7

Use radix sort. Split each 1-dimensional index to d .. 32 parts, each of size 1 .. 32/d bits. Then (from high-order bits to low-order bits) for each index piece compute its Hilbert value and shuffle objects to proper bins.

This should work well with both evenly and unevenly distributed data, both Hilbert ordering or Z-order. And no multi-precision calculations needed.

One detail about converting index pieces to Hilbert order:

  • first extract necessary bits,
  • then interleave bits from all dimensions,
  • then convert 1-dimensional indexes to inverse Gray code.

If the indexes are stored in doubles:

  • If indexes may be negative, add some value to make everything positive and thus simplify the task.
  • Determine the smallest integer power of 2, which is greater than all the indexes and divide all indexes to this value
  • Multiply the index to 2^(necessary number of bits for current sorting step). Truncate the result, convert it to integer, and use it for Hilbert ordering (interleave and compute the inverse Gray code)
  • Subtract the result, truncated on previous step, from the index: index = index - i

Coming to your variant of radix sort, i'd suggest to extend zsort (to make hilbertsort out of zsort) with two binary arrays of size d (one used mostly as a stack, other is used to invert index bits) and the rotation value (used to rearrange dimensions).

If top value in the stack is 1, change pivotize(... ascending) to pivotize(... descending), and then for the first part of the recursion, push this top value to the stack, for second one - push the inverse of this value. This stack should be restored after each recursion. It contains the "decision tree" of last d recursions of radix sort procedure (in inverse Gray code).

After d recursions this "decision tree" stack should be used to recalculate both the rotation value and the array of inversions. The exact way how to do it is non-trivial. It may be found in the following links: hilbert.c or hilbert.c.

Thorton answered 13/12, 2011 at 12:12 Comment(11)
Re: Edit3. Doesn't seem to work out right yet. According to my sketches, I will actually need to vary the splitting sequence. In 2D, my first two splits are x and y. The third split however will be x for cubes 2 and 3 in my sketch above, whereas for cubes 1 and 4 I will have to split in y (for the divide and conquer to work). This corresponds to rotated iterations of the pattern. In 3D+ it will become even more complex. :-( But +1 nevertheless for your helpful reply.Gybe
How does the ringbuffer determine the next split axis?Gybe
The first axis split after one level varies from pattern to pattern. In the initial setting (which is after splitting x,y), 1 and 4 must be split y,x, while 2 and 3 must be split x,y. This corresponds to the Hilbert basic pattern being used rotated.Gybe
Yes. There is a flaw in this algorithm somewhere. I need to think.Thorton
Alas, all my attempts to find working algorithm for d>2 failed. I think I should just remove this answer.Thorton
No, it is still helpful in understand hilbert curves, I need to study more carefully how the ith significant bit is to be computed. There must be a way, and it must be closely related to the usual methods. Maybe it is the gray encoding that messes things up, and I e.g. need to look at the current state and the previous bit.Gybe
I've made some progress. Pen and paper for the win. ;-) I found a so far mostly working encoding. Interestingly enough, I only need depth(mod d is enough), next axis, and two boolean flags I called "right" and "ascending". Well, now I have to sort out the remaining chaos, then see if it also works for 3+d...Gybe
As far as I know, for 3+d you'll need at least d flags which are needed to "reorder" dimensions in some non-trivial way.Thorton
I thought I read somewhere that there are many 3+d hilbert curves, and one can be constructed by using cyclic permutations of the axes only. That was the one I was planning to use. But yes, it may well show up that instead of the two booleans I have right now I need a Bitset. I have figured out a notion on paper that seems to work, and by drawing the tree I was able to figure out the rules. I'll try that for 3D next, and hope this generalizes to 4+d, too. (3d result is looking already quite good, probably just missing one inversion bit).Gybe
I fixed all the errors in my answer text, and also added couple of links to relatively siple C programs where the procedures of dimensions rotation and the inversions are stated clearly. If information, you already found, is incomplete, look at these links.Thorton
While I didn't get it working right yet, you spent a lot of time and your comments were helpful so I awarded you the bounty.Gybe
E
2

You can compute the hilbert curve from f(x)=y directly without using recursion or L-systems or divide and conquer. Basically it's a gray code or hamiltonian path traversal. You can find a good description at Nick's spatial index hilbert curve quadtree blog or from the book hacker's delight. Or take a look at monotonic n-ary gray code. I've written an implementation in php including a moore curve.

Esposito answered 13/12, 2011 at 1:49 Comment(4)
Well, I don't want to compute the Hilbert curve. I want to sort objects according to their position along it, without computing the full hilbert curve, but only lazly computing those separators that I need.Gybe
@Anonymouse: I don't think you can compute half a curve but you can use a table i.e. precalculated curve.Esposito
Well, I'm not at all looking at curves. I'm just looking at single bits for ordering. The "visual" stuff is not what I'm interested in.Gybe
@Anonymouse: Unfortunaetly it's called hilbert curve and what you suggesting is something different. I know what you mean by sorting. I use it myself. I've written a php version with a moore curve. Also the hilbert curve doesn't hurt the eye.Esposito
H
0

I already answered this question (and others) but my answer(s) mysteriously disappeared. The Compact Hilbert Index implemention from http://code.google.com/p/uzaygezen/source/browse/trunk/core/src/main/java/com/google/uzaygezen/core/CompactHilbertCurve.java (method index()) already allows one to limit the number of hilbert index bits computed up to a given level. Each iteration of the loop from the mentioned method computes a number of bits equal to the dimensionality of the space. You can easily refactor the for loop to compute just one level (i.e., a number of bits equal to the dimensionality of the space) at a time, going only as deeply as needed to compare lexicographically two numbers by their Compact Hilbert Index.

Hedveh answered 29/3, 2012 at 12:3 Comment(1)
But the code is still level-wise, not bitwise. For a high dimensional setting, say 128 dimensions, that seems like a bit more work than necessary, as much less than the 128 dimensions can in fact suffice already to discern all of my data (at least unless very heavily clustered).Gybe

© 2022 - 2024 — McMap. All rights reserved.