I'm thinking of implementing 8-ary heapsort for uint32_t's. To do this I need a function that selects the index of maximum element in a 8-element vector so that I can compare it with parent element and conditionally perform swap and further siftDown steps.
(8 uint32_ts can be changed eg to 16 uint32_ts or 8 uint64_t or whatever x86 SIMD could support efficiently).
I have some ideas on how to do that but I'm looking for something faster than non-vectorized code, especially I'm looking for something that would enable me to do fast heapsort.
I have clang++ 3.3 and Core i7-4670 so probably I should be able to use even the newest x86 SIMD thingies.
(BTW: that's a part of a bigger project: https://github.com/tarsa/SortingAlgorithmsBenchmark and there's for example quaternary heapsort so after implementing SIMD heapsort I could instantly compare them)
To repeat - question is: what's the most efficient way to compute index of maximum element in x86 SIMD vector?
PS: It's not a duplicate of linked questions - notice that I'm asking for an index of a maximum element, not just the element value.