I would do it like that.
inline int compareBytes( __m256i a, __m256i b )
{
// Compare for both a <= b and a >= b
__m256i min = _mm256_min_epu8( a, b );
__m256i le = _mm256_cmpeq_epi8( a, min );
__m256i ge = _mm256_cmpeq_epi8( b, min );
// Reverse bytes within 16-byte lanes
const __m128i rev16 = _mm_set_epi8( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 );
const __m256i rev32 = _mm256_broadcastsi128_si256( rev16 );
le = _mm256_shuffle_epi8( le, rev32 );
ge = _mm256_shuffle_epi8( ge, rev32 );
// Move the masks to scalar registers
uint32_t lessMask = (uint32_t)_mm256_movemask_epi8( le );
uint32_t greaterMask = (uint32_t)_mm256_movemask_epi8( ge );
// Flip high/low 16-bit pieces in the masks.
// Apparently, modern compilers are smart enough to emit ROR instructions for that code
lessMask = ( lessMask >> 16 ) | ( lessMask << 16 );
greaterMask = ( greaterMask >> 16 ) | ( greaterMask << 16 );
// Produce the desired result
if( lessMask > greaterMask )
return -1;
else if( lessMask < greaterMask )
return +1;
else
return 0;
}
The reason that method works, integer comparison is essentially searching for the most significant bit which differs, and comparison result is equal to the difference in that most significant different bit. Because we reversed order of the bytes being tested, the first byte in the vectors corresponds to the most significant bit in the masks. For this reason, ( lessMask > greaterMask )
expression evaluates to true when for the first different byte in the source vectors ( a < b )
evaluated to true.
strncmp
does not need to return-1
,0
,1
it just needs to return<0
,0
,>0
– Dun