Yes, an "implicit tree" with no pointers is a good storage format if you never have to insert/delete, where position in the tree determines array index. It makes it possible to read multiple elements without the data-dependency / latency of pointer chasing.
Some neat tricks you can do with this that are helpful in practice on real superscalar out-of-order x86 CPUs with caches, where throughput is much better than latency:
- SIMD brute force the start of the search
- software-prefetch 2 to 3 levels ahead of where you're checking now. The grandchildren of node
k
are contiguous at 4*k + 0..3
. That comes from 2*(2k)
, 2*(2k)+1
, 2*(2k+1)
, 2*(2k+1)+1
. So you probably just want to prefetch at 4*k
or 8*k
because the rest of the nodes are probably in the same cache line.
- Actually compare keys from 2 levels ahead, i.e. advance by 2 levels at once in the tree to fill the latency from the data dependency. (Suggested in comments by @BeeOnRope). Do a branchless search of the current and 2 child nodes, increasing memory-level parallelism by requesting all that data before making any decisions based on it. Or even extend this to looking at the 4 grandchildren, going 3 levels at a time.
Instead of a strictly binary tree, use an implicit 5-ary tree. An n
-ary tree has n-1
keys per node that partition the child nodes into n
subtrees. Since you can search 4, 8, or 16 elements efficiently with x86 SIMD (to find the first element greater than your search key), you probably want a 5-ary, 9-ary, or 17-ary tree. (In general, 2^n + 1
). You also want to check them for ==
with SIMD, which you can do in parallel with pcmpeqd
.
To find which subtree to choose, do mask |= 1<<n
before bit-scan on a pcmpgtd
/ movmskps
result. e.g. mask = _mm_movemask_ps(cast(cmp_result));
(a mask where the first 1
is the first element that was greater than the key). But what if your key is greater than all of them, so you want the last subtree. mask
will be all zero, but you want _BitScanForward(mask)
to give you 4
= arity-1
. Do mask |= 1<<4
before bitscan, instead of checking for the mask==0
special case. (You might template your class on the arity of your tree.)
Software prefetch is still an option here but with a "node" being larger (more elements) and bigger fan-out mean that you'd probably have to prefetch multiple cache lines. Branching isn't going to be good, so you're kind of stuck with the data dependency. But you have much fewer levels with log17
instead of log2
.
x86 can multiply by 5 or 9 very efficiently with an lea
instruction, and compilers know how to do this. (LEA can only left-shift by up to 3, so not 17). But going up the tree by dividing by a non-power-of-2 is less efficient, requiring a multiply and shift instruction for a multiplicative inverse.
An n
-ary tree can be traversed in-order pretty easily, just like a binary tree but with a loop over nodes inside an element.
Going multiple levels per step is not super useful, just increase the arity of your tree instead.
Implicit quaternary tree (4-ary) with a different structure: 1 element per node (not 3), but you decide which of the 4 subtrees to descend by actually looking at all 4 child nodes, not the current node.
Make each node the max (instead of median) of the subtree that it's the root of, so you're looking for the first child node > key. You can do this with SIMD pcmpgtd
+ movmskps + bsf. The tree is still "full" so the subtrees are still balanced to have equal numbers of nodes, so this does still end in O(log4(N))
. Tricks like striding by 2 levels are still possible here, and potentially easier because of nodes being the max of their subtree.
This max-tree data structure might still be in-order traversable, so it might still be useful if that's the property you wanted from a binary tree.
It scales to 8-ary trees with 2x SSE2 vectors or 1x AVX2. (Or with narrower elements for more elements per SIMD vector).
see below (past the section on other data structures) for another section about the CPU architecture challenges of binary searching, and data dependencies (especially with cache misses but even from L2 or even L1d cache) vs. letting branch speculation work its magic at the cost of mispredicts.
You can speed up the start of a find()
search in your implicit tree with SIMD brute force over the first (for example) 256 bytes (4 cache lines); x86 has efficient SIMD (linear) searching. Align the start of your array by 128 bytes for efficient HW prefetch and SIMD. That will give you a starting point a few levels deep in your tree, or find an exact hit if there was one in the top few levels.
This skips directly to the 5th or 6th level of the tree, depending on how many bytes / elements you brute-force search this way. (And on the element width: narrower elements make this even better, because more fit into one SIMD vector).
Each level has 2^k
elements, and the sum of all the higher levels is 2^k - 1
(a number with all the bits set below bit k
). So searching a power-of-2 number of elements checks a complete number of levels plus 1 extra in the next level.
You probably want to use Intel's intrinsics to manually vectorize this, like _mm_cmpeq_epi32
and _mm_cmpgt_epi32
for signed 32-bit integers. Then pack 4x 32-bit compare vectors down to 1 with 2x _mm_packss_epi32
and 1x _mm_packss_epi16
. This sets you up for _mm_movemask_epi8
(pmovmskb
) to get a bitmap of 16 (SSE2) or 32 (AVX2) elements into an integer register. Which you can search by checking if it's non-zero, then __builtin_ctz
or _BitScanForward
(bsf
), whichever your compiler supports. Or BMI1 _mm_tzcnt_32
.
So you use SIMD vector compares to get a vector of 0 / -1 elements. Then get that into integer bitmaps in integer registers where you can check them for non-zero or use bitscan instructions to find the first non-zero bit. x86 SIMD can do integer compare for equality or signed greater-than. (Until AVX512 adds unsigned, and your choice of comparison predicates).
If you want unsigned compares, store the tree with the unsigned values range-shifted to signed (x ^ 0x80000000
), so you can do that to your key once on input and then use signed greater-than (_mm_cmpgt_epi32
), unless you can use AVX512.
You need to cover the possibility of == key
in all levels, and also find where in the last level to stop. Use SSE2 or AVX2 pcmpgtb/w/d
(_mm_cmpgt_epi8/16/32
) for the greater-than part.
You can switch from doing a search for equality in the first half of the brute-force range (parent levels) over to doing a search for greater-than in the last half (the deepest level). There may be an off-by-one in there; make sure you account for it. It may be ok to allow overlap between the eq
and gt
elements.
If you have a large collection of 8-bit integer, you probably want to store it as count buckets, not a tree with each copy of the same number stored in separate bytes. (There are only 256 possible values for a byte, so any collection larger than that will have repeats).
This can be an implicit-tree of buckets, or of key/count pairs.
As discussed in comments (now moved to chat):
A binary tree is usually not optimal for repeated lookups on a fixed data set
Traversing the tree requires branch prediction (that can mispredict) or a data dependency that prevents out-of-order execution from working its magic. Either choice has downsides.
Packing your data into a complete balanced tree is time consuming, so if you can do that you can presumably do other pre-processing to create efficient lookups. Or maybe your data is even a compile-time constant.
Hash tables are the standard for O(1) lookup, especially when you can ahead-of-time choose a hash function that does well for the actual data you have (i.e. compile-time constant lookup table). Finding a "perfect hash" function is sometimes possible, and removes the possibility of collisions. (A minimal perfect hash is even harder to find, but possible for some small sets: a function that maps to an array no bigger than the number of elements). std::unordered_set
is a hash table.
Hash tables don't allow any kind of ordered traversal from the element you find, though. And C++ (ordered) std::set
is typically a Red-Black tree (optimized for insert/delete as well as lookup) so it's worse than you can do with careful tuning of an implicit tree.
Other options could include a bitwise Trie that goes through the bits of the key to choose left/right in a tree. Opendatastructures.org has a chapter on sorted sets. They discuss a bitwise Trie, and 2 refinements: using hashing for lookups while descending the tree, leading to find(x)
in O(log w)
expected time instead of O(w)
, where w
is the bit-width of your integers.
Or maybe something that looks at 2 or 4 bits at once, selecting one of 4 or 16 children. (i.e. a B-tree). Although reducing tree depth with a B-tree is probably not good for in-memory data vs. other options.
If you need a tree-like data structure but not necessarily a binary tree, that might still be good. (I'm not sure if you'd still need integer compares, or if you'd just be checking pointers for non-NULL. If you did need a compare, pcmpgtd
/ movmskps
+ bsf
to do 4 integer compares in parallel and produce a bitmap or index of first greater-than could be useful).
For future readers: if you don't specifically need this tree layout for other reasons, consider a different data structure for storing an ordered or un-ordered set.
If you have so many values that a presence/absence bitmap would be "dense" (and your set doesn't need to represent duplicates), that's a good option. e.g. uint16_t
data with more than about 8k elements means that on average 1 in 8 bits are set. Or look for the break-even in memory size, e.g. 65536 bits = 8k bytes. vs. 4k element * 2 bytes/element = 8k bytes.
In C you'd probably want to store this as an array of size_t[]
because that's likely to be as wide as a CPU register.
Probing for an element requires only 1 memory access (and a bit-scan within the loaded dword, which x86 can do efficiently with bts reg,reg
, but in C you'd just write if(x & (1ULL<<pos))
or if((x>>pos) & 1)
and hope your compiler does that optimization instead of using a separate shift).
Finding the previous/next element that is present is also efficient: use pcmpeqb
to search (forward or backward) in the bitmap for the first 16-byte chunk containing a non-zero bit. When you find it, use bsf
or bsr
on the byte you found. A good C++ std::vector<bool>
specialization will do that for you, but don't count on it; The LLVM project's libc++ has a good std::find()
specialization, but the libstdc++ implementation compiles to garbage. See also my comments on that blog post, including a godbolt link.
Traversing this data in order is also efficiently possible. x = (x-1) & x;
clears the lowest set bit. If it's non-zero, you can find the new lowest bit with a bitscan.
If you do need a binary tree for some reason
(e.g. maybe you want to do something with the subtree starting at the element you found, maybe even taking advantage of the layout details of the array of integers.
Or just as a point of comparison against other data structures.)
Choose a type that's as narrow as possible to reduce the cache footprint, leading to more cache hits. x86 movzx
zero-extending loads are exactly as efficient as normal loads on modern CPUs (no extra latency for the zero-extension), or a cmp [mem],reg
is possible with 8, 16, 32, or 64-bit operand-size.
Your implicit tree (with no pointers, just implicit next_idx = 2*idx + 0
or +1
, like a heap but with the nodes sorted) is a good way to store a binary search tree.
As discussed in comments on the question (now moved to chat), branch prediction + speculative execution works as a prefetch for future loads, and the 2*k + 0 or 1
formula means that there's spatial locality between both possibilities for several levels forward.
Or if you choose to use a data-dependency instead of a branch (probably not a good plan), you could SW prefetch future cache lines a few levels ahead of the one you're currently comparing, to help the HW achieve some memory-level parallelism despite the data dependency which would otherwise serialize the loads.
Out-of-order exec doesn't speculate on load results. There aren't any current x86 CPUs that attempt value prediction to turn branchless 2*k + (key<arr[k])
into speculative, so the next load won't have its load address ready until the current load + compare + add finishes.
Or if you need the binary tree version of your data for something else, you might be able to index it with another data structure that allows more efficient lookups, like a hash table that maps integers to pointers (or indices) into the array.
Or maybe index one mid-way level of it with some kind of ordered data structure. That gives you a starting point for your deeper search. If checking the next/previous element(s) in the array (sibling / cousin nodes in the tree) lets us tell if the value we're looking for was in a parent level, not child, then this could be extra handy. Otherwise we'd probably want to index the upper levels of the tree, too.
That's only about twice as much data (or going one level higher for the same size).
Footnote 1: A sorted array with a plain binary search is basically equivalent; the binary search algorithm computes a new array index at each step with branching or a data dependency.
But it has worse locality for the start of different searches: the first few elements that are touched as part of most searches are scattered across multiple cache lines, instead of all at the start of the array. i.e. the "hot" common elements have worse locality and need more cache lines to stay hot.
A sorted array has good locality once binary search gets close to the right place, and at that point you probably want to switch to an SSE2 or AVX2 SIMD linear search once you find the right 16 to 64-element range spanning a cache line or two, especially if you have AVX2.
This is possible with brute force (no branching or data dependencies), just use pcmpeqd
/ movmskps
to compute (in an integer register) a bitmap of == key
. Search it with bsf
or tzcnt
for the position of the matching element (or zero for no match). Or just check if it's non-zero if you don't care about position.
This is a simple version of what I'm suggesting for the start of a search over an implicit binary tree.