How to choose AVX compare predicate variants
Asked Answered
A

2

65

In the Advanced Vector Extensions (AVX) the compare instructions like _m256_cmp_ps, the last argument is a compare predicate. The choices for the predicate overwhelm me. They seem to be a tripple of type, ordering, signaling. E.g. _CMP_LE_OS is 'less than or equal, ordered, signaling.

For starters, is there a performance reason for selecting signaling or non signaling, and similarly, is ordered or unordered faster than the other?

And what does 'non signaling' even mean? I can't find this in the docs at all. Any rule of thumb on when to select what?

Here are the predicate choices from avxintrin.h:

/* Compare */
#define _CMP_EQ_OQ    0x00 /* Equal (ordered, non-signaling)  */
#define _CMP_LT_OS    0x01 /* Less-than (ordered, signaling)  */
#define _CMP_LE_OS    0x02 /* Less-than-or-equal (ordered, signaling)  */
#define _CMP_UNORD_Q  0x03 /* Unordered (non-signaling)  */
#define _CMP_NEQ_UQ   0x04 /* Not-equal (unordered, non-signaling)  */
#define _CMP_NLT_US   0x05 /* Not-less-than (unordered, signaling)  */
#define _CMP_NLE_US   0x06 /* Not-less-than-or-equal (unordered, signaling)  */
#define _CMP_ORD_Q    0x07 /* Ordered (nonsignaling)   */
#define _CMP_EQ_UQ    0x08 /* Equal (unordered, non-signaling)  */
#define _CMP_NGE_US   0x09 /* Not-greater-than-or-equal (unord, signaling)  */
#define _CMP_NGT_US   0x0a /* Not-greater-than (unordered, signaling)  */
#define _CMP_FALSE_OQ 0x0b /* False (ordered, non-signaling)  */
#define _CMP_NEQ_OQ   0x0c /* Not-equal (ordered, non-signaling)  */
#define _CMP_GE_OS    0x0d /* Greater-than-or-equal (ordered, signaling)  */
#define _CMP_GT_OS    0x0e /* Greater-than (ordered, signaling)  */
#define _CMP_TRUE_UQ  0x0f /* True (unordered, non-signaling)  */
#define _CMP_EQ_OS    0x10 /* Equal (ordered, signaling)  */
#define _CMP_LT_OQ    0x11 /* Less-than (ordered, non-signaling)  */
#define _CMP_LE_OQ    0x12 /* Less-than-or-equal (ordered, non-signaling)  */
#define _CMP_UNORD_S  0x13 /* Unordered (signaling)  */
#define _CMP_NEQ_US   0x14 /* Not-equal (unordered, signaling)  */
#define _CMP_NLT_UQ   0x15 /* Not-less-than (unordered, non-signaling)  */
#define _CMP_NLE_UQ   0x16 /* Not-less-than-or-equal (unord, non-signaling)  */
#define _CMP_ORD_S    0x17 /* Ordered (signaling)  */
#define _CMP_EQ_US    0x18 /* Equal (unordered, signaling)  */
#define _CMP_NGE_UQ   0x19 /* Not-greater-than-or-equal (unord, non-sign)  */
#define _CMP_NGT_UQ   0x1a /* Not-greater-than (unordered, non-signaling)  */
#define _CMP_FALSE_OS 0x1b /* False (ordered, signaling)  */
#define _CMP_NEQ_OS   0x1c /* Not-equal (ordered, signaling)  */
#define _CMP_GE_OQ    0x1d /* Greater-than-or-equal (ordered, non-signaling)  */
#define _CMP_GT_OQ    0x1e /* Greater-than (ordered, non-signaling)  */
#define _CMP_TRUE_US  0x1f /* True (unordered, signaling)  */
Aurangzeb answered 7/6, 2013 at 15:52 Comment(1)
If you're not going to be encountering NaNs then it really doesn't matter.Foliate
Y
27

When either operand is NaN, ordered vs unordered dictates the result value.

Ordered comparisons returns false for NaN operands.

  • _CMP_EQ_OQ of 1.0 and 1.0 gives true (vanilla equality).
  • _CMP_EQ_OQ of NaN and 1.0 gives false.
  • _CMP_EQ_OQ of 1.0 and NaN gives false.
  • _CMP_EQ_OQ of NaN and NaN gives false.

Unordered comparison returns true for NaN operands.

  • _CMP_EQ_UQ of 1.0 and 1.0 gives true (vanilla equality).
  • _CMP_EQ_UQ of NaN and 1.0 gives true.
  • _CMP_EQ_UQ of 1.0 and NaN gives true.
  • _CMP_EQ_UQ of NaN and NaN gives true.

The difference between signalling vs non-signalling only impacts the value of the MXCSR. To observe the effect, you'd need to clear the MXCSR, perform one or more comparisons, then read from the MXCSR (thanks to Peter Cordes for clarifying this!).

The order of the enum values is pretty confusing. It helps to put them in a table...

comparison ordered (non-signalling) unordered (non-signalling)
a < b _CMP_LT_OQ _CMP_NGE_UQ
a <= b _CMP_LE_OQ _CMP_NGT_UQ
a == b _CMP_EQ_OQ _CMP_EQ_UQ
a != b _CMP_NEQ_OQ _CMP_NEQ_UQ
a >= b _CMP_GE_OQ _CMP_NLT_UQ
a > b _CMP_GT_OQ _CMP_NLE_UQ
true _CMP_ORD_Q _CMP_TRUE_UQ (useless)
false _CMP_FALSE_OQ (useless) _CMP_UNORD_Q

With MXCSR "signaling":

comparison ordered (signalling) unordered (signalling)
a < b _CMP_LT_OS _CMP_NGE_US
a <= b _CMP_LE_OS _CMP_NGT_US
a == b _CMP_EQ_OS _CMP_EQ_US
a != b _CMP_NEQ_OS _CMP_NEQ_US
a >= b _CMP_GE_OS _CMP_NLT_US
a > b _CMP_GT_OS _CMP_NLE_US
true _CMP_ORD_S _CMP_TRUE_US (useless)
false _CMP_FALSE_OS (useless) _CMP_UNORD_S

The order of the enum values can be explained by:

  • The first four ops are canonical (EQ, LT, LE, UNORD). Note that if the 0x00 and 0x03 values were LE/UNORD or UNORD/LE, the four canonical ops could be viewed as a composition of two separate bits, but that's not possible for their actual order.

  • The remaining ops are transformations of the first four.

  • The 0x04 bit precisely inverts the result value, which also effectively also toggles ordered vs unordered. For example, LT_O becomes NLT_U, which is similar to GE, but see the rule for unordered naming.

  • The 0x08 bit toggles ordered vs unordered (without changing anything else).

  • Setting both the 0x04 and 0x08 bits negates the result for numerical operands, while retaining the same ordering behavior for NaN operands. For example, LT_O becomes GE_O.

  • Note that when the comparison is unordered (ie, one of 0x04 or 0x08 is set), a negated name is used: NGE instead of LT, NGT instead of LE, NLT instead of GE, and NLE instead of GT; however both EQ and NEQ need to define both ordered and unordered variants, so those names only change under the 0x04 negating transformation, not the 0x08 orderedness-toggling transformation.

  • FALSE/TRUE are mostly useless 0x08 transformations of UNORD/ORD, always returning the same value. For example, UNORD (0x03) returns false if both operands are numbers, or true if either is NaN; adding 0x08, we get FALSE (0x0b), which has toggled behavior for NaN operands, causing it to return false for both cases.

    Fun fact: the TRUE op wasn't always completely useless. Prior to AVX2 it was the most compact mechanism for setting a YMM register to all 1's. See https://godbolt.org/z/Yb5TjP for details (Thanks Peter Cordes).

  • The 0x10 bit toggles signaling vs not. Note that of the canonical ops, LE and LT are signaling, while EQ and UNORD are not, so setting the 0x10 bit removes signaling from the LE/LT ops and adds it to the EQ/UNORD ops. Because that's obviously sensible and not at all confusing.

Yarn answered 4/10, 2020 at 5:2 Comment(7)
I just commented on the other answer about signalling vs. quiet: FP exceptions are masked by default, so unless you check MXCSR to see if any masked "invalid" exceptions happened since you last cleared it, you won't know. Or unmask that exception. Oh, the question was asking about that, guess I should answer.Horten
Actually my answer on What does ordered / unordered comparison mean? already covers it. Could maybe use an edit for clarity, though.Horten
BTW, _CMP_TRUE_UQ is not 100% useless, only 98% - it's the most compact way to set a YMM register to all-ones with AVX1 but not AVX2 so you don't have vpcmpeqd ymm15, ymm0,ymm0. Some compilers will use it when targeting SandyBridge (-mavx without -mavx2) if you use _mm256_set1_epi8(-1) (which is rare to need for FP vectors, to the point of almost being artificial / unrealistic). It does have a false dependency so of course you'd use AVX2 integer when available. And yes, the false predicate is useless, vxorps xmm15, xmm0,xmm0 is a more efficient way to zero ymm15.Horten
@PeterCordes - that's an interesting factoid! Verified in godbolt.org/z/eW1Ws8. For AVX1, the compiler first does vxorps, possibly to mitigate the false dependency; however, in AVX2, where the compiler uses vpcmpeqd as you predicted, there is no preceding vxorps. Does the CPU have special logic to eliminate this false dependency? (as it does for xor instructions)Yarn
Yes, Agner Fog's microarch guide confirms that pcmpeq* is dep-breaking on all CPUs (except silvermont), even though it does need an execution unit to write the ones (even on Sandybridge-family where xor-zeroing is eliminated). (You can verify this by noting that the throughput is better than 1 even using the same register.) See also Fastest way to set __m256 value to all ONE bitsHorten
@PeterCordes - Thanks! FYI - I'd be curious if you have any thoughts on #66037572Yarn
@DaveDopson Which of them match the IEEE 754 standard? Ordered for everything except a != b, and unordered for a != b because the correct answer for NAN != NAN is true?Hoashis
D
43

Ordered vs Unordered has to do with whether the comparison is true if one of the operands contains a NaN (see What does ordered / unordered comparison mean?). Signaling (S) vs non-signaling (Q for quiet?) will determine whether an exception is raised if an operand contains a NaN.

From a performance perspective, these should all be the same (assuming of course no exceptions are raised). If you want to be alerted when there's a NaN, then you want signaling. As for ordered vs unordered, it all depends on how you want to deal with NaNs.

Dissatisfactory answered 15/7, 2013 at 23:6 Comment(1)
Signalling really only means the FP "invalid" flag will be set even when comparing "quiet" (normal) NaNs. To actually be "alerted", you'd have to have unmasked FP-invalid exceptions in the MXCSR, or check the MXCSR sticky flag to see if any invalid exceptions happened since you last cleared it. The point of Q vs. S is that it lets you compare normal NaNs without treating it like a divide-by-zero or inf - inf, or sqrt(-1). (SNaNs are not naturally occurring; those other invalid operations produce QNaN if exceptions are masked, the default setting.)Horten
Y
27

When either operand is NaN, ordered vs unordered dictates the result value.

Ordered comparisons returns false for NaN operands.

  • _CMP_EQ_OQ of 1.0 and 1.0 gives true (vanilla equality).
  • _CMP_EQ_OQ of NaN and 1.0 gives false.
  • _CMP_EQ_OQ of 1.0 and NaN gives false.
  • _CMP_EQ_OQ of NaN and NaN gives false.

Unordered comparison returns true for NaN operands.

  • _CMP_EQ_UQ of 1.0 and 1.0 gives true (vanilla equality).
  • _CMP_EQ_UQ of NaN and 1.0 gives true.
  • _CMP_EQ_UQ of 1.0 and NaN gives true.
  • _CMP_EQ_UQ of NaN and NaN gives true.

The difference between signalling vs non-signalling only impacts the value of the MXCSR. To observe the effect, you'd need to clear the MXCSR, perform one or more comparisons, then read from the MXCSR (thanks to Peter Cordes for clarifying this!).

The order of the enum values is pretty confusing. It helps to put them in a table...

comparison ordered (non-signalling) unordered (non-signalling)
a < b _CMP_LT_OQ _CMP_NGE_UQ
a <= b _CMP_LE_OQ _CMP_NGT_UQ
a == b _CMP_EQ_OQ _CMP_EQ_UQ
a != b _CMP_NEQ_OQ _CMP_NEQ_UQ
a >= b _CMP_GE_OQ _CMP_NLT_UQ
a > b _CMP_GT_OQ _CMP_NLE_UQ
true _CMP_ORD_Q _CMP_TRUE_UQ (useless)
false _CMP_FALSE_OQ (useless) _CMP_UNORD_Q

With MXCSR "signaling":

comparison ordered (signalling) unordered (signalling)
a < b _CMP_LT_OS _CMP_NGE_US
a <= b _CMP_LE_OS _CMP_NGT_US
a == b _CMP_EQ_OS _CMP_EQ_US
a != b _CMP_NEQ_OS _CMP_NEQ_US
a >= b _CMP_GE_OS _CMP_NLT_US
a > b _CMP_GT_OS _CMP_NLE_US
true _CMP_ORD_S _CMP_TRUE_US (useless)
false _CMP_FALSE_OS (useless) _CMP_UNORD_S

The order of the enum values can be explained by:

  • The first four ops are canonical (EQ, LT, LE, UNORD). Note that if the 0x00 and 0x03 values were LE/UNORD or UNORD/LE, the four canonical ops could be viewed as a composition of two separate bits, but that's not possible for their actual order.

  • The remaining ops are transformations of the first four.

  • The 0x04 bit precisely inverts the result value, which also effectively also toggles ordered vs unordered. For example, LT_O becomes NLT_U, which is similar to GE, but see the rule for unordered naming.

  • The 0x08 bit toggles ordered vs unordered (without changing anything else).

  • Setting both the 0x04 and 0x08 bits negates the result for numerical operands, while retaining the same ordering behavior for NaN operands. For example, LT_O becomes GE_O.

  • Note that when the comparison is unordered (ie, one of 0x04 or 0x08 is set), a negated name is used: NGE instead of LT, NGT instead of LE, NLT instead of GE, and NLE instead of GT; however both EQ and NEQ need to define both ordered and unordered variants, so those names only change under the 0x04 negating transformation, not the 0x08 orderedness-toggling transformation.

  • FALSE/TRUE are mostly useless 0x08 transformations of UNORD/ORD, always returning the same value. For example, UNORD (0x03) returns false if both operands are numbers, or true if either is NaN; adding 0x08, we get FALSE (0x0b), which has toggled behavior for NaN operands, causing it to return false for both cases.

    Fun fact: the TRUE op wasn't always completely useless. Prior to AVX2 it was the most compact mechanism for setting a YMM register to all 1's. See https://godbolt.org/z/Yb5TjP for details (Thanks Peter Cordes).

  • The 0x10 bit toggles signaling vs not. Note that of the canonical ops, LE and LT are signaling, while EQ and UNORD are not, so setting the 0x10 bit removes signaling from the LE/LT ops and adds it to the EQ/UNORD ops. Because that's obviously sensible and not at all confusing.

Yarn answered 4/10, 2020 at 5:2 Comment(7)
I just commented on the other answer about signalling vs. quiet: FP exceptions are masked by default, so unless you check MXCSR to see if any masked "invalid" exceptions happened since you last cleared it, you won't know. Or unmask that exception. Oh, the question was asking about that, guess I should answer.Horten
Actually my answer on What does ordered / unordered comparison mean? already covers it. Could maybe use an edit for clarity, though.Horten
BTW, _CMP_TRUE_UQ is not 100% useless, only 98% - it's the most compact way to set a YMM register to all-ones with AVX1 but not AVX2 so you don't have vpcmpeqd ymm15, ymm0,ymm0. Some compilers will use it when targeting SandyBridge (-mavx without -mavx2) if you use _mm256_set1_epi8(-1) (which is rare to need for FP vectors, to the point of almost being artificial / unrealistic). It does have a false dependency so of course you'd use AVX2 integer when available. And yes, the false predicate is useless, vxorps xmm15, xmm0,xmm0 is a more efficient way to zero ymm15.Horten
@PeterCordes - that's an interesting factoid! Verified in godbolt.org/z/eW1Ws8. For AVX1, the compiler first does vxorps, possibly to mitigate the false dependency; however, in AVX2, where the compiler uses vpcmpeqd as you predicted, there is no preceding vxorps. Does the CPU have special logic to eliminate this false dependency? (as it does for xor instructions)Yarn
Yes, Agner Fog's microarch guide confirms that pcmpeq* is dep-breaking on all CPUs (except silvermont), even though it does need an execution unit to write the ones (even on Sandybridge-family where xor-zeroing is eliminated). (You can verify this by noting that the throughput is better than 1 even using the same register.) See also Fastest way to set __m256 value to all ONE bitsHorten
@PeterCordes - Thanks! FYI - I'd be curious if you have any thoughts on #66037572Yarn
@DaveDopson Which of them match the IEEE 754 standard? Ordered for everything except a != b, and unordered for a != b because the correct answer for NAN != NAN is true?Hoashis

© 2022 - 2024 — McMap. All rights reserved.