SSE4.1 phminposuw
is a horizontal min of 8x 16-bit elements. 8 elements otherwise takes 3 shuffle/min steps, 6 instructions, so if we can use fewer uops than that to massage our data into an input for _mm_minpos_epu16
, we come out ahead. It's single-uop with 3 to 4 cycle latency across Zen / Skylake / Gracemont, so it's also a latency win vs. 6 shuffle/min ops. (https://uops.info/)
It's not available for wider vectors even with AVX-512, so we do need two reduction steps to get our 32 elements down to 8, but we can get them zero-extended to 16-bit for free by being clever about it. (It does also find the position of the min, but only on our already-reduced input which isn't useful here.)
Bitwise NOT turns UINT8_MIN (0
) into UINT8_MAX (255
) and vice versa. It's the same thing as 255-x
which more obviously does what we want, a 1:1 transform that reverses an unsigned <=
compare, without wrapping for any uint8_t
value. (Thanks to @harold and @Homer512 in comments for fixing my initial idea.)
@chtz had a good idea: unpack the data with indices, so the index is part of the max or min u16 element we find, skipping the broadcast/compare/movemask/bit-scan. This is orthogonal to using phminpos, and works for any ISA and element-size as long as we have a vertical min or max for elements that are twice as wide. The first reduction step changes to doing something to both inputs, widening elements, instead of just shuffling the high half down to the bottom. The biggest benefit is critical path latency, but this does also help throughput slightly.
In case of a tie, when the original 8-bit value is equal, the low half, the index, will be the tie-break. If you want the first occurrence of the max element, min with inverted elements and normal indices works: the smaller index will be the min. Or max with normal elements and inverted indices, which you flip after some reduction.
#include "vectorclass.h"
int maxpos_u8_noscan_unpack(const Vec32uc v)
{
Vec32uc indices = _mm256_setr_epi8( ~0, ~1, ~2, ~3, ~4, ~5, ~6, ~7,
~8, ~9,~10,~11,~12,~13,~14,~15,
~16,~17,~18,~19,~20,~21,~22,~23,
~24,~25,~26,~27,~28,~29,~30,~31);
__m256i lo = _mm256_unpacklo_epi8(indices, v); // pair value:index in a u16 with value most significant
__m256i hi = _mm256_unpackhi_epi8(indices, v); // then find the max of those u16 elements and extract the payload
Vec16us vu16 = _mm256_max_epu16(lo, hi);
Vec8us vu8 = max(vu16.get_low(), vu16.get_high()); // narrow to 128-bit
// on a tie, we take the max ~index or min index = first occurence of max
// inverted indices allow XOR with set1(-1), a cheaper constant to generate than set1_epi16(0xff00)
auto vinv = ~vu8; // 1:1 mapping that flips max to min, same as 255-v (for bytes)
// Now we're looking for the min u16, whose low 8 bits are the index of the original u8
__m128i minpos = _mm_minpos_epu16(vinv); // min value in the bottom u16 of the vector
unsigned min = _mm_cvtsi128_si32(minpos);
return (uint8_t)min; // movd + scalar movzx is cheaper than pextrb 0
// a smarter compiler would vmovd into a different reg like EDX, allowing mov-elimination for the movzx
}
Godbolt with a test loop (uiCA) to see how they inline and hoist constants, i.e. to get a loop I can copy/paste into https://uica.uops.info/
I was going to write the indices as normal numbers, and use unpack(~indices, v)
. But MSVC didn't constant-propagate through the bitwise not, doing it at run-time!
Clang is too clever for it's own good: it sees that some of the indices
input elements aren't used by _mm256_unpacklo_epi8
so it zeros them out. Same for the unpack hi. But that means it needs to load 2 different vector constants instead of having unpack read parts of the same one! (TODO: report on https://github.com/llvm/llvm-project/issues/)
There are multiple strategies for unpacking:
chtz's suggestions of vpunpckl/hbw
, as above. The in-lane behaviour is funky for 256-bit shuffles, but they do pair corresponding elements of the two input vectors. This way needs fewer vector constants, just one non-trivial one except when clang pessimizes it, and fewer vector ALU uops, but more of them are shuffles, so on Skylake and earlier, port 5 could be a bottleneck depending on the surrounding code. Not a problem as cleanup for you loop in Why performance of this function is so slow on Intel i3-N305 compared to AMD Ryzen 7 3800X? (where chtz's strategy could be expanded to your whole loop so the cleanup is just a max
reduction of u16 elements, including the inversion for phminposuw.)
Ice Lake and later, and Gracemont, can run vpunpck
integer shuffles on more than one port so back-end port throughtput bottlenecks aren't a problem.
Odd/even by shifting the even elements left to the top of a u16, and mask to isolate the odds. (Then XOR to both flip the data and apply the index into the other half.) Fewer of the uops are shifts, but 1 more total uop than unpack. And more of the instructions are 256-bit YMM, worse on Zen 1 or your Gracemont.
With AVX-512 (or AVX10 256-bit), vpternlogd
is useful, but not much else.
Compared to not using chtz's trick at all, using bcast/cmp/movemask/scan: no vector constants needed at all. Perhaps good as a one-off, like cleanup for an infrequently-run loop. It's 10 uops to get a scalar integer result (starting from data in a YMM vector) vs. 9 for vpunpckl/hbw on an Intel P-core. (7 uops for that to get the index in the bottom byte of an XMM register, or 8 total (including a vpand
or vpmovzxbq
) to get it zero-extended into a qword at the bottom, where you could vpaddd
or vpaddq
to accumulate it into an XMM total which you retrieve after your outer loop.)
shift/XOR strategy, fewer shuffles but more uops and constants, possible use-cases on Skylake
int maxpos_u8_noscan_odd_even(const Vec32uc v)
{
// 0,2,4... unpacked with FF bytes
__m256i indices_even = _mm256_setr_epi16(
0xff00, 0xff02, 0xff04, 0xff06, 0xff08, 0xff0a, 0xff0c, 0xff0e,
0xff10, 0xff12, 0xff14, 0xff16, 0xff18, 0xff1a, 0xff1c, 0xff1e);
__m256i indices_odd = _mm256_sub_epi16(indices_even, _mm256_set1_epi16(-1)); // save typing, and it wouldn't be too bad if the compiler actually did it at runtime this way. MSVC 19.14 (2017) actually does!
// _mm256_setr_epi16(0xff01, 0xff03, ...
__m256i even = _mm256_slli_epi16(v, 8); // bring values to the top of each u16
even = _mm256_xor_si256(even, indices_even); // AVX-512: merge-masked vpshufb? But that doesn't flip the data. Do that separately? Use merge-masked sub for odds?
__m256i odd = _mm256_and_si256(v, _mm256_set1_epi16(0xff00));
odd = _mm256_xor_si256(odd, indices_odd); // AVX-512 _mm256_ternlog_epi32 would be great
// NOT the data (255-v) while we unpack with indices; now we're looking for the min element
Vec16us vu16 = _mm256_min_epu16(odd, even); // values are the "key", and bring the index along with
Vec8us vu8 = min(vu16.get_low(), vu16.get_high()); // narrow to 128-bit
__m128i minpos = _mm_minpos_epu16(vu8); // value:index in the bottom two bytes of the vector
unsigned min = _mm_cvtsi128_si32(minpos);
return (uint8_t)min; // movd + scalar movzx is cheaper than pextrb 0
}
Earlier version, with bcast / cmp / movemask / tzcnt cleanup
int maxpos_u8(const Vec32uc v)
{
auto vinv32 = ~v; // 1:1 mapping that flips max to min
// now we're looking for minpos_u8(vinv32).
Vec16uc vinv = min(vinv32.get_low(), vinv32.get_high());
// Reduce to 8x u16
Vec8us vus = Vec8us(min(vinv, _mm_srli_epi16(vinv, 8)));
// min of each pair 0-extended into a 16-bit element, since minu(0, x) = 0 in high half
Vec32uc minval = _mm256_broadcastb_epi8(_mm_minpos_epu16(vus)); // bottom u16 is the min value, second u16 is the min pos
auto match = (minval == vinv32);
return _tzcnt_u32(_mm256_movemask_epi8(match));
// VCL has silly inline asm for BSF.
// Use __builtin_ctz, std::countr_zero, or a portable tzcnt intrinsic since we have AVX2
}
AVX2 basically implies BMI1/2, so requiring it for tzcnt
too won't exclude any AVX2 CPUs. Compile with GCC/Clang -march=x86-64-v3 (https://en.wikipedia.org/wiki/X86-64#Microarchitecture_levels) or MSVC -arch:AVX2 which actually implies BMI2 as well. Or use std::countr_zero
from C++20 #include <bit>
, but make sure to enable BMI so compilers can use tzcnt
.
I changed the arg type to take a vector by value (in a YMM register). This means the non-inline version needs to shuffle instead of loading the 128-bit halves separately to reduce to 128. This is generally a good convention since callers might pass vectors that aren't already in memory, although you want small functions like this to inline.
Godbolt
# clang and GCC make identical asm, except clang spends a vzeroupper when not inlining.
maxpos_u8(Vec32uc):
vpcmpeqd ymm1, ymm1, ymm1 # all-ones mask, can be hoisted from loops
vpxor ymm0, ymm0, ymm1 # NOT = xor(v, -1)
vextracti128 xmm2, ymm0, 0x1 # same first shuffle step
vpminub xmm1, xmm0, xmm2
vpsrlw xmm2, xmm1, 8 # 16-bit element size shifts in zeros
vpminub xmm1, xmm1, xmm2
vphminposuw xmm1, xmm1
vpbroadcastb ymm1, xmm1 # then same cleanup
vpcmpeqb ymm0, ymm0, ymm1
vpmovmskb eax, ymm0
tzcnt eax, eax
ret
According to https://uica.uops.info/, it's 10 uops for the front-end on Skylake (not counting the vpcmpeqd
ones idiom that can be hoisted out of loops, and not counting the ret
.) vs. the VCL version (using tzcnt
instead of branch around bsf) being 14 uops.
It has 15 cycle latency from input to the final ymm0
result. (Plus another maybe 4 or 6 cycles for movemask + tzcnt, which is the same for both versions if we improve the VCL cleanup, or even with bsf
on Intel.)
The VCL version is 16 cycle latency from ymm0 input to vpcmpeqb
result, so my version is better by 1 cycle for latency. (Probably by 2 cycles on Zen-family, where phminposuw
is only 3c latency instead of 4 on Intel.)
My version is 10 uops (not counting vpcmpeqd
, assuming it was hoisted out of a loop) vs. 14 for the VCL version (assuming it's changed to use tzcnt
).
6 of the uops for the VCL version can only run on port 5 on Skylake, so a loop doing only this would bottleneck on that for throughput. On Ice Lake they're fairly evenly distributed since the shuffles can run on port 1 or 5.
Machine-code size for my version is smaller by 13 bytes (counting the vpcmpeqd
set1(-1)
since it's part of static code size); smaller is generally good for I-cache density. (The VCL shuffles could have avoided immediate operands for some, saving a couple bytes.)
This worked out better than I thought it might. I was worried we were going to have to invert again after phminposuw
to recover the actual max element, but matching against the transformed vinv32
avoids that. (If you do want the value as well, uint8_t min = ~_mm_cvtsi128_si32(minpos_result);
- movd
+ not
, and maybe a movzx
if you need it as 32-bit.) And phminposuw
leaves the value at the bottom of the register, position above that, so a broadcast to 256 worked without an extra shift or shuffle.
Also, at first I though we might need an AND or something to zero-extend to 16-bit, or that I'd have to shift left so the value we wanted was in most-significant half of the u16 elements (so low garbage wouldn't matter). But again, then we'd have needed to broadcast a byte that wasn't the lowest, which would have required a shift + vpbroadcast, or AVX-512 vpermb
.
VCL uses bsf
!?!?
The most obvious thing to improve on is VCL's insane use of inline asm
for a legacy BSF instruction (slow on AMD), instead of using GNU C __builtin_ctz
on non-MSVC. That can compile to tzcnt
on CPUs that have it (all CPUs with AVX2 have BMI1/2). Even on CPUs that don't have it, since unlike lzcnt
, the tzcnt
result is compatible with bsf
for all cases where __builtin_ctz
has a well-defined result (when the input is non-zero), and tzcnt
=rep bsf
which runs as bsf
on older CPUs. (There's some possible justification for using inline asm for bsr
in bit_scan_reverse
, but probably still better to use 63-
or 31 - __builtin_clz
there.)
And we know there will be a match, so horizontal_find_first
's if (bits == 0) return -1;
is totally useless even if we were using bsf
instead of tzcnt
. (tzcnt
produces 32
or 64
in that case, the operand-size.)
Avoiding the broadcast: stay 256-bit the whole time?
This isn't better for u8 or u16 elements. (Also, chtz's idea of unpacking with indices made it a moot point.)
If you weren't using vphminposuw
, you might avoid the vpbroadcastb
by doing all your shuffles at 256-bit, so you finish with the max already broadcast. e.g. vpermi128
instead of vextracti128
. But I'm not sure how you swap bytes within words or dwords without needing a control vector for vpshufb
. Perhaps vpalignr ymm, same,same, imm8
with 8 / 4 / 2 / 1? It's port 5 only even on Ice Lake, and higher latency (2 cycles) on Zen 4 vs. vpshufd
. And the first two shuffles (by 8 and 4) can be done with vpshufd
which can run on more port 1 or 5 on Ice Lake and later, for better balance of uops for execution ports, which is generally good unless the surrounding code under-uses port 5.
But this strategy is worse for throughput on Zen 1 and Alder Lake E-cores (Gracemont), since all the 256-bit ops will be 2 uops. They're wide enough that it's ok for latency, you don't get resource conflicts.
max
call (2**3 == 8 ints == 32 byte
). You compute for 8 bit ints. That needs 5 shuffles andmax
calls (2**5 == 32 byte
). That's what you see in the assembly. I don't see any particular inefficiencies. Which part do you think is superfluous or inefficient? – Floodlightphminposuw
? It only exists in a 128-bit form, but you could use it and broadcast the result to find it position in the original vector. But it's a min, so you'd need a 1:1 mapping that makes the largest unsigned element into the smallest. I think subtract-from-0 does that. – RelationUSHORT_MAX - x
should work, right? Something like this: godbolt.org/z/ExnMc9rfc But if I understand OP correctly, he wants a library solution rather than intrinsics. I'm not aware of such as thing and SO doesn't permit that type of question anyway – FloodlightUSHORT_MAX - x
works, it's equivalent tox ^ USHORT_MAX
too .. apparently GCC knows that – Bordello