What's the difference between logical SSE intrinsics?
Asked Answered
P

3

21

Is there any difference between logical SSE intrinsics for different types? For example if we take OR operation, there are three intrinsics: _mm_or_ps, _mm_or_pd and _mm_or_si128 all of which do the same thing: compute bitwise OR of their operands. My questions:

  1. Is there any difference between using one or another intrinsic (with appropriate type casting). Won't there be any hidden costs like longer execution in some specific situation?

  2. These intrinsics maps to three different x86 instructions (por, orps, orpd). Does anyone have any ideas why Intel is wasting precious opcode space for several instructions which do the same thing?

Proudhon answered 10/5, 2010 at 17:32 Comment(1)
(earlier answer deleted on account of being doofishly wrong -- my fault for being too used to VMX)Twister
F
3

I think all three are effectively the same, i.e. 128 bit bitwise operations. The reason different forms exist is probably historical, but I'm not certain. I guess it's possible that there may be some additional behaviour in the floating point versions, e.g. when there are NaNs, but this is pure guesswork. For normal inputs the instructions seem to be interchangeable, e.g.

#include <stdio.h>
#include <emmintrin.h>
#include <pmmintrin.h>
#include <xmmintrin.h>

int main(void)
{
    __m128i a = _mm_set1_epi32(1);
    __m128i b = _mm_set1_epi32(2);
    __m128i c = _mm_or_si128(a, b);

    __m128 x = _mm_set1_ps(1.25f);
    __m128 y = _mm_set1_ps(1.5f);
    __m128 z = _mm_or_ps(x, y);
        
    printf("a = %vld, b = %vld, c = %vld\n", a, b, c);
    printf("x = %vf, y = %vf, z = %vf\n", x, y, z);

    c = (__m128i)_mm_or_ps((__m128)a, (__m128)b);
    z = (__m128)_mm_or_si128((__m128i)x, (__m128i)y);

    printf("a = %vld, b = %vld, c = %vld\n", a, b, c);
    printf("x = %vf, y = %vf, z = %vf\n", x, y, z);
    
    return 0;
}

Terminal:

$ gcc -Wall -msse3 por.c -o por
$ ./por

a = 1 1 1 1, b = 2 2 2 2, c = 3 3 3 3
x = 1.250000 1.250000 1.250000 1.250000, y = 1.500000 1.500000 1.500000 1.500000, z = 1.750000 1.750000 1.750000 1.750000
a = 1 1 1 1, b = 2 2 2 2, c = 3 3 3 3
x = 1.250000 1.250000 1.250000 1.250000, y = 1.500000 1.500000 1.500000 1.500000, z = 1.750000 1.750000 1.750000 1.750000
Finery answered 10/5, 2010 at 18:42 Comment(5)
ORPD/ORPS are SSE-only, not MMX.Donnadonnamarie
But Intel introduced orps and later orpd both after por. And the physical basis of SSE has never changed much.Donnadonnamarie
The physical basis of SSE has changed a lot, particularly since Woodcrest, when it finally became a full 128 bit unit. However that's probably irrelevant - it sounds like I may be wrong about why there are separate bitwise OR instructions - I thought it was a legacy thing to do with switching context between integer and floating point SSE operations in the old days, but perhaps not.Finery
re: the speculation in the first paragraph: all versions of the bitwise logical ops are exactly identical except for instruction size and performance. Creating a NaN with bitwise FP ops won't do anything special. IDK if performance (data forwarding with FP domain vs. vector-int domain) or programmer-friendliness / insn set orthogonality (not having to use int ops on FP data) was the bigger motivating factor. I should write up an answer, since I have read some stuff that nobody's mentioned...Petta
Interchanging them randomly is best to generally avoid due to Data Bypass Delay which instructions actually cost an extra cycle is very instruction / micro arch dependent i.e on Nehalem there is a 1c bypass delay on shufps / shufd but on haswell there is none. But as a general rule if an equally performant instruction exists for the same data type as the surrounding ones use that.Thermostat
P
21
  1. Is there any difference between using one or another intrinsic (with appropriate type casting). Won't there be any hidden costs like longer execution in some specific situation?

Yes, there can be performance reasons to choose one vs. the other.

1: Sometimes there is an extra cycle or two of latency (forwarding delay) if the output of an integer execution unit needs to be routed to the input of an FP execution unit, or vice versa. It takes a LOT of wires to move 128b of data to any of many possible destinations, so CPU designers have to make tradeoffs, like only having a direct path from every FP output to every FP input, not to ALL possible inputs.

See this answer, or Agner Fog's microarchitecture doc for bypass-delays. Search for "Data bypass delays on Nehalem" in Agner's doc; it has some good practical examples and discussion. He has a section on it for every microarch he has analysed.

However, the delays for passing data between the different domains or different types of registers are smaller on the Sandy Bridge and Ivy Bridge than on the Nehalem, and often zero. -- Agner Fog's micro arch doc

Remember that latency doesn't matter if it isn't on the critical path of your code (except sometimes on Haswell/Skylake where it infects later use of the produced value, long after actual bypass :/). Using pshufd instead of movaps + shufps can be a win if uop throughput is your bottleneck, rather than latency of your critical path.

2: The ...ps version takes 1 fewer byte of code than the other two for legacy-SSE encoding. (Not AVX). This will align the following instructions differently, which can matter for the decoders and/or uop cache lines. Generally smaller is better for better code density in I-cache and fetching code from RAM, and packing into the uop cache.

3: Recent Intel CPUs can only run the FP versions on port5.

  • Merom (Core2) and Penryn: orps can run on p0/p1/p5, but integer-domain only. Presumably all 3 versions decoded into the exact same uop. So the cross-domain forwarding delay happens. (AMD CPUs do this too: FP bitwise instructions run in the ivec domain.)

  • Nehalem / Sandybridge / IvB / Haswell / Broadwell: por can run on p0/p1/p5, but orps can run only on port5. p5 is also needed by shuffles, but the FMA, FP add, and FP mul units are on ports 0/1.

  • Skylake: por and orps both have 3-per-cycle throughput. Intel's optimization manual has some info about bypass forwarding delays: to/from FP instructions it depends on which port the uop ran on. (Usually still port 5 because the FP add/mul/fma units are on ports 0 and 1.) See also Haswell AVX/FMA latencies tested 1 cycle slower than Intel's guide says - "bypass" latency can affect every use of the register until it's overwritten.

Note that on SnB/IvB (AVX but not AVX2), only p5 needs to handle 256b logical ops, as vpor ymm, ymm requires AVX2. This was probably not the reason for the change, since Nehalem did this.

How to choose wisely:

Keep in mind that compilers can use por for _mm_or_pd if they want, so some of this applies mostly to hand-written asm. But some compilers are somewhat faithful to the intrinsics you choose.

If logical op throughput on port5 could be a bottleneck, then use the integer versions, even on FP data. This is especially true if you want to use integer shuffles or other data-movement instructions.

AMD CPUs always use the integer domain for logicals, so if you have multiple integer-domain things to do, do them all at once to minimize round-trips between domains. Shorter latencies will get things cleared out of the reorder buffer faster, even if a dep chain isn't the bottleneck for your code.

If you just want to set/clear/flip a bit in FP vectors between FP add and mul instructions, use the ...ps logicals, even on double-precision data, because single and double FP are the same domain on every CPU in existence, and the ...ps versions are one byte shorter (without AVX).

There are practical / human-factor reasons for using the ...pd versions, though, with intrinsics. Readability of your code by other humans is a factor: They'll wonder why you're treating your data as singles when it's actually doubles. For C/C++ intrinsics, littering your code with casts between __m128 and __m128d is not worth it. (And hopefully a compiler will use orps for _mm_or_pd anyway, if compiling without AVX where it will actually save a byte.)

If tuning on the level of insn alignment matters, write in asm directly, not intrinsics! (Having the instruction one byte longer might align things better for uop cache line density and/or decoders, but with prefixes and addressing modes you can extend instructions in general)

For integer data, use the integer versions. Saving one instruction byte isn't worth the bypass-delay between paddd or whatever, and integer code often keeps port5 fully occupied with shuffles. For Haswell, many shuffle / insert / extract / pack / unpack instructions became p5 only, instead of p1/p5 for SnB/IvB. (Ice Lake finally added a shuffle unit on another port for some more common shuffles.)

  1. These intrinsics maps to three different x86 instructions (por, orps, orpd). Does anyone have any ideas why Intel is wasting precious opcode space for several instructions which do the same thing?

If you look at the history of these instruction sets, you can kind of see how we got here.

por  (MMX):     0F EB /r
orps (SSE):     0F 56 /r
orpd (SSE2): 66 0F 56 /r
por  (SSE2): 66 0F EB /r

MMX existed before SSE, so it looks like opcodes for SSE (...ps) instructions were chosen out of the same 0F xx space. Then for SSE2, the ...pd version added a 66 operand-size prefix to the ...ps opcode, and the integer version added a 66 prefix to the MMX version.

They could have left out orpd and/or por, but they didn't. Perhaps they thought that future CPU designs might have longer forwarding paths between different domains, and so using the matching instruction for your data would be a bigger deal. Even though there are separate opcodes, AMD and early Intel treated them all the same, as int-vector.


Related / near duplicate:

Petta answered 5/7, 2015 at 17:22 Comment(0)
C
7

According to Intel and AMD optimization guidelines mixing op types with data types produces a performance hit as the CPU internally tags 64 bit halves of the register for a particular data type. This seems to mostly effect pipe-lining as the instruction is decoded and the uops are scheduled. Functionally they produce the same result. The newer versions for the integer data types have larger encoding and take up more space in the code segment. So if code size is a problem use the old ops as these have smaller encoding.

Caddaric answered 20/8, 2010 at 19:36 Comment(2)
"mixing op types with data types produces a performance hit..." Can you explain that futher or on give me the references on that, thanks.Giuseppinagiustina
@Giuseppinagiustina its due to Data Bypass Delay.Thermostat
F
3

I think all three are effectively the same, i.e. 128 bit bitwise operations. The reason different forms exist is probably historical, but I'm not certain. I guess it's possible that there may be some additional behaviour in the floating point versions, e.g. when there are NaNs, but this is pure guesswork. For normal inputs the instructions seem to be interchangeable, e.g.

#include <stdio.h>
#include <emmintrin.h>
#include <pmmintrin.h>
#include <xmmintrin.h>

int main(void)
{
    __m128i a = _mm_set1_epi32(1);
    __m128i b = _mm_set1_epi32(2);
    __m128i c = _mm_or_si128(a, b);

    __m128 x = _mm_set1_ps(1.25f);
    __m128 y = _mm_set1_ps(1.5f);
    __m128 z = _mm_or_ps(x, y);
        
    printf("a = %vld, b = %vld, c = %vld\n", a, b, c);
    printf("x = %vf, y = %vf, z = %vf\n", x, y, z);

    c = (__m128i)_mm_or_ps((__m128)a, (__m128)b);
    z = (__m128)_mm_or_si128((__m128i)x, (__m128i)y);

    printf("a = %vld, b = %vld, c = %vld\n", a, b, c);
    printf("x = %vf, y = %vf, z = %vf\n", x, y, z);
    
    return 0;
}

Terminal:

$ gcc -Wall -msse3 por.c -o por
$ ./por

a = 1 1 1 1, b = 2 2 2 2, c = 3 3 3 3
x = 1.250000 1.250000 1.250000 1.250000, y = 1.500000 1.500000 1.500000 1.500000, z = 1.750000 1.750000 1.750000 1.750000
a = 1 1 1 1, b = 2 2 2 2, c = 3 3 3 3
x = 1.250000 1.250000 1.250000 1.250000, y = 1.500000 1.500000 1.500000 1.500000, z = 1.750000 1.750000 1.750000 1.750000
Finery answered 10/5, 2010 at 18:42 Comment(5)
ORPD/ORPS are SSE-only, not MMX.Donnadonnamarie
But Intel introduced orps and later orpd both after por. And the physical basis of SSE has never changed much.Donnadonnamarie
The physical basis of SSE has changed a lot, particularly since Woodcrest, when it finally became a full 128 bit unit. However that's probably irrelevant - it sounds like I may be wrong about why there are separate bitwise OR instructions - I thought it was a legacy thing to do with switching context between integer and floating point SSE operations in the old days, but perhaps not.Finery
re: the speculation in the first paragraph: all versions of the bitwise logical ops are exactly identical except for instruction size and performance. Creating a NaN with bitwise FP ops won't do anything special. IDK if performance (data forwarding with FP domain vs. vector-int domain) or programmer-friendliness / insn set orthogonality (not having to use int ops on FP data) was the bigger motivating factor. I should write up an answer, since I have read some stuff that nobody's mentioned...Petta
Interchanging them randomly is best to generally avoid due to Data Bypass Delay which instructions actually cost an extra cycle is very instruction / micro arch dependent i.e on Nehalem there is a 1c bypass delay on shufps / shufd but on haswell there is none. But as a general rule if an equally performant instruction exists for the same data type as the surrounding ones use that.Thermostat

© 2022 - 2024 — McMap. All rights reserved.