Intel documents __m256 _mm256_set_m128(__m128 hi, __m128 lo)
and _mm256_setr_m128(lo, hi)
as intrinsics for the vinsertf128
instruction, which is what you want1. (Of course there are also __m256d
and __m256i
versions, which use the same instruction. The __m256i version may use vinserti128
if AVX2 is available, otherwise it'll use f128 as well.)
These days, those intrinsics are supported by current versions of all 4 major x86 compilers (gcc, clang, MSVC, and ICC). But not by older versions; like some other helper intrinsics that Intel documents, widespread implementation has been slow. (Often GCC or clang are the last hold-out to not have something you wish you could use portably.)
Use it if you don't need portability to old GCC versions: it's the most readable way to express what you want, following the well known _mm_set
and _mm_setr
patterns.
Performance-wise, it's of course just as efficient as manual cast + vinsertf128
intrinsics (@Mysticial's answer), and for gcc at least that's literally how the internal .h
actually implements _mm256_set_m128
.
Compiler version support for _mm256_set_m128
/ _mm256_setr_m128
:
- clang: 3.6 and newer. (Mainline, IDK about Apple)
- GCC: 8.x and newer, not present as recently as GCC7!
- ICC: since at least ICC13, the earliest on Godbolt.
- MSVC: since at least 19.14 and 19.10 (WINE) VS2015, the earliest on Godbolt.
https://godbolt.org/z/1na1qr has test cases for all 4 compilers.
__m256 combine_testcase(__m128 hi, __m128 lo) {
return _mm256_set_m128(hi, lo);
}
They all compile this function to one vinsertf128
, except MSVC where even the latest version wastes a vmovups xmm2, xmm1
copying a register. (I used -O2 -Gv -arch:AVX
to use the vectorcall convention so args would be in registers to make an efficient non-inlined function definition possible for MSVC.) Presumably MSVC would be ok inlining into a larger function if it could write the result to a 3rd register, instead of the calling convention forcing it to read xmm0 and write ymm0.
Footnote 1:
vinsertf128
is very efficient on Zen1, and as efficient as vperm2f128
on other CPUs with 256-bit-wide shuffle units. It can also take the high half from memory in case the compiler spilled it or is folding a _mm_loadu_ps
into it, instead of needing to separately do a 128-bit load into a register; vperm2f128
's memory operand would be a 256-bit load which you don't want.
https://uops.info/ / https://agner.org/optimize/