Using SIMD/AVX/SSE for tree traversal

Asked 16/12, 2013 at 17:11 Answered 1/1, 2014 at 6:42

Solved performance assembly simd micro-optimization avx

I am currently researching whether it would be possible to speed up a van Emde Boas (or any tree) tree traversal. Given a single search query as input, already having multiple tree nodes in the cache line (van emde Boas layout), tree traversal seems to be instruction-bottlenecked.

Being kinda new to SIMD/AVX/SSE instructions, I would like to know from experts in that topic whether it would be possible to compare multiple nodes at once to a value and then find out which tree path to follow further on. My research lead to the following question:

How many CPU cycles/instructions are wasted on construction of SIMD/AVX/SSE register etc.. This would make its use for the wayne, if construction takes more time than traversing the whole sub-tree manually (2+4+8 nodes in 1 cache line of size 64 bytes).

How many CPU cycles/instructions are wasted on finding the proper SIMD/AVX/SSE register holding the answer of which path to follow on ? Could anybody come up with a smart way so that those findMinimumInteger AVX instructions could be used to decide that in 1 (??) CPU cycle ?

What is your guess ?

Another, more tricky approach to speed up tree traversal would be to have multiple search queries run down at once, when there is high probability to land in nodes closely together in the last tree level. Any guesses on this ? Ofc it would have to put those queries aside that do not belong to the same sub-tree any longer and then recursively find them after finishing the first "parallel traversal" of the tree.. The tree queries have sequential, though not constant access patterns (query[i] always < than query[i+1]).

Important: this stuff is about integer tree's, which is why van Emde Boas Tree is used (maybe x-fast/y-fast tries later on)

I am curious about what is your 50 cents on this issue, given that one might be interested in the highest achievable performance on large scale tree's. Thank you in advance for your time spending on this though :-)

Udell answered 16/12, 2013 at 17:11 Comment(3)

If you have lots of trees, I'd be tempted to make each tree search be a parallel thread. (We do this in program analysis/transformation tool we build; seems to work reasonably). Why isn't that one of your considered options? Another idea: if you have multiple queries, and you know what they are in advance, you can compile them into an FSA used to guide the searches. The part of the FSA generated by common query subterms is processed only once, at a considerable savings. (Look at LR parsers for a similar pattern-product trick). – Edrisedrock 16/12, 2013 at 18:14

We will use massive threading anyways. This is just about a single tree's most efficient implementation on AVX512 hardware. – Udell 17/12, 2013 at 1:20

Related: What is the most efficient way to implement a BST in such a way the find(value) function is optimized for random values in the tree on x86? - my answer there proposes an N-ary tree (like 4 keys per block to select one of 5 nodes to look at next, or 8 to 16 since more parallelism at each step reduces depth. Not a binary search.) i.e. Something that you can quickly brute-force search with SSE2/AVX2. That question is about a fixed dictionary, mutability not needed, so an implicit tree that doesn't store pointers is great for density. – Express 27/9, 2022 at 8:27

I've used SSE2/AVX2 to help perform a B+tree search. Here's code to perform a "binary search" on a full cache line of 16 DWORDs in AVX2:

// perf-critical: ensure this is 64-byte aligned. (a full cache line)
union bnode
{
    int32_t i32[16];
    __m256i m256[2];
};

// returns from 0 (if value < i32[0]) to 16 (if value >= i32[15]) 
unsigned bsearch_avx2(bnode const* const node, __m256i const value)
{
    __m256i const perm_mask = _mm256_set_epi32(7, 6, 3, 2, 5, 4, 1, 0);

    // compare the two halves of the cache line.

    __m256i cmp1 = _mm256_load_si256(&node->m256[0]);
    __m256i cmp2 = _mm256_load_si256(&node->m256[1]);

    cmp1 = _mm256_cmpgt_epi32(cmp1, value); // PCMPGTD
    cmp2 = _mm256_cmpgt_epi32(cmp2, value); // PCMPGTD

    // merge the comparisons back together.
    //
    // a permute is required to get the pack results back into order
    // because AVX-256 introduced that unfortunate two-lane interleave.
    //
    // alternately, you could pre-process your data to remove the need
    // for the permute.

    __m256i cmp = _mm256_packs_epi32(cmp1, cmp2); // PACKSSDW
    cmp = _mm256_permutevar8x32_epi32(cmp, perm_mask); // PERMD

    // finally create a move mask and count trailing
    // zeroes to get an index to the next node.

    unsigned mask = _mm256_movemask_epi8(cmp); // PMOVMSKB
    return _tzcnt_u32(mask) / 2; // TZCNT
}

You'll end up with a single highly predictable branch per bnode, to test if the end of the tree has been reached.

This should be trivially scalable to AVX-512.

To preprocess and get rid of that slow PERMD instruction, this would be used:

void preprocess_avx2(bnode* const node)
{
    __m256i const perm_mask = _mm256_set_epi32(3, 2, 1, 0, 7, 6, 5, 4);
    __m256i *const middle = (__m256i*)&node->i32[4];

    __m256i x = _mm256_loadu_si256(middle);
    x = _mm256_permutevar8x32_epi32(x, perm_mask);
    _mm256_storeu_si256(middle, x);
}

Orestes answered 16/12, 2013 at 17:49 Comment(13)

Your B-tree nodes fit in a single cache line. I can't imagine that the SSE(etc.) would provide much of a performance benefit even if the B-tree fit entirely into the cache (which seems like a pretty strnage case). I've built in-memory B-trees in assembler that have these same constraints; pretty much you only get a real "single branch" per node because the branch predictor pretty much gets it right. In the worst case, you can do a binary search on the keys in the node; there's only 6 average. Can you quote with SSE and without numbers for comparison? – Edrisedrock 16/12, 2013 at 18:10

I'm at work right now so I can't look for the code. The SIMD is basically a quick way to perform a binary search on a fixed number of integers, and reduces those branches. That's all it does. – Orestes 16/12, 2013 at 18:56

I am looking forward on testing this, since we will perform on AVX512-supporting devices. I was thinking on putting all data in the last level of the tree and use the first log2(n)-1 levels as a fast query-accelerator; fitting more nodes in a cache line (not needing data pointers there if tree is static); also it would remove the requirement to check for equality on every node check / loop iteration - only one == is needed after finishing all iterations. – Udell 17/12, 2013 at 1:19

BTW, any special reason for storing branch pointers ? I strongly feel its a waste of cache space. Shifting down by 4 bytes instead of 1 for binary trees should work fine. – Udell 17/12, 2013 at 1:23

If your B-tree nodes can be anywhere, how can you avoid the pointers? Are you assuming that the tree is contiguous in memory? – Edrisedrock 17/12, 2013 at 3:32

Branch pointers were needed in my case, but it's easy to see cases that can optimize around it. – Orestes 17/12, 2013 at 5:12

And Ira, a custom allocator can be made to provide contiguous nodes and a base pointer. – Orestes 17/12, 2013 at 5:13

Yes, I would put the stuff next to each other. The space saved by not using pointers can be used to oversize the memory allocation for dynamic trees. With this project, we are fine with static sized trees. Thats also why Im thinking about using only the last level of the tree, which wouldnt work so great if you need to insert into it.. – Udell 18/12, 2013 at 1:43

Ive added a benchmark on how various methods work out on tree traversal! Thank you guys for helping me alot with the AVX part. – Udell 1/1, 2014 at 6:42

Another interesting question of this would be; if you erase the penalty of misspredicted branches, being more efficient on instruction side but kinda bound by the loading of new data; one could do additional operations on the data while waiting for next data to arrive. I could image it to have some use in game engines. I was also wondering when the "load cacheline +1" prefetching gets issued. So far, my tree memory layout does not offer any +1 cacheline paths like DFS in cacheline chunks (vEB there) could offer. Potential improvement ? – Udell 2/1, 2014 at 8:24

I imagine you could take some advantage of prefetching if make your algorithm process in steps and do something else in between. I doubt it would be useful otherwise but I'd be curious to see what you come up with. – Orestes 2/1, 2014 at 17:6

That's not actually a binary search inside the B-tree node; it's an O(N * log2(N)) parallel brute-force search, which is really good for small N. (N = 2 ymm vectors in this case). (The log2(N) part is the packing down to a single scalar bitmap. Although for large N, we'd still only merge down to byte elements, then do a final merging step after vpmovmskb, and use multiple _tzcnt_u64. So I guess it's really O(N)). Anyway, looks optimal to me for this problem size, and nice to store in permuted form, so you can use vpacksswd with no shuffling. – Express 7/6, 2016 at 21:54

@Udell and Cory: See also What is the most efficient way to implement a BST in such a way the find(value) function is optimized for random values in the tree on x86? - my answer there proposes an N-ary tree (like 4 keys per block to select one of 5 nodes to look at next), basically the same idea as a B-tree, something that you can quickly brute-force search with SSE2/AVX2. But make it an implicit tree that doesn't store pointers, if it doesn't need to be quickly mutable. I didn't implement it or benchmark the cache locality vs. branching, etc. – Express 19/6, 2022 at 6:14

Based on your code, I've went ahead and benchmarked 3 options: AVX2-powered, nested branching (4 jumps) and a branchless variant. These are the results:

// Performance Table...
// All using cache-line size 64byteAligned chunks (van Emde-Boas Layout); loop unrolled per cacheline;
// all optimizations turned on. Each Element being 4 byte's. Intel i7 4770k Haswell @3.50GHz

Type        ElementAmount       LoopCount       Avg. Cycles / Query
===================================================================
AVX2        210485750           100000000       610 cycles    
AVX2        21048575            100000000       427 cycles           
AVX2        2104857             100000000       288 cycles 
AVX2        210485              100000000       157 cycles   
AVX2        21048               100000000       95 cycles  
AVX2        2104                100000000       49 cycles    
AVX2        210                 100000000       17 cycles 
AVX2        100                 100000000       16 cycles   


Type        ElementAmount       LoopCount       Avg. Cycles / Query
===================================================================  
Branching   210485750           100000000       819 cycles 
Branching   21048575            100000000       594 cycles 
Branching   2104857             100000000       358 cycles 
Branching   210485              100000000       165 cycles 
Branching   21048               100000000       82 cycles
Branching   2104                100000000       49 cycles 
Branching   210                 100000000       21 cycles 
Branching   100                 100000000       16 cycles   


Type        ElementAmount       LoopCount       Avg. Cycles / Query
=================================================================== 
BranchLESS  210485750           100000000       675 cycles 
BranchLESS  21048575            100000000       602 cycles 
BranchLESS  2104857             100000000       417 cycles
BranchLESS  210485              100000000       273 cycles 
BranchLESS  21048               100000000       130 cycles 
BranchLESS  2104                100000000       72 cycles 
BranchLESS  210                 100000000       27 cycles 
BranchLESS  100                 100000000       18 cycles

So my conclusion looks like: when memory access is kinda optimal, AVX can help with Tree's bigger than 200k Elements. Below that there is hardly any penalty to pay (if you dont use AVX for anything else). It's been worth the night of benchmarking this. Thanks to everybody involved :-)

Udell answered 1/1, 2014 at 6:42 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags