How to combine two __m128 values to __m256?
Asked Answered
G

5

23

I would like to combine two __m128 values to one __m256.

Something like this:

__m128 a = _mm_set_ps(1, 2, 3, 4);
__m128 b = _mm_set_ps(5, 6, 7, 8);

to something like:

__m256 c = { 1, 2, 3, 4, 5, 6, 7, 8 };

are there any intrinsics that I can use to do this?

Godgiven answered 20/6, 2012 at 9:40 Comment(0)
P
29

This should do what you want:

__m128 a = _mm_set_ps(1,2,3,4);
__m128 b = _mm_set_ps(5,6,7,8);

__m256 c = _mm256_castps128_ps256(a);
c = _mm256_insertf128_ps(c,b,1);

If the order is reversed from what you want, then just switch a and b.


The intrinsic of interest is _mm256_insertf128_ps which will let you insert a 128-bit register into either lower or upper half of a 256-bit AVX register:

http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_avx_insertf128_ps.htm

The complete family of them is here:

Pepsinate answered 20/6, 2012 at 9:54 Comment(3)
Some versions of Visual Studio (definitely 2010, possibly some later ones too) have a bug in their handling of _mm256_castps128_ps256, so this code is likely to crash on them. See connect.microsoft.com/VisualStudio/feedback/details/653771/…. If you need your code to work on those compilers, you'll need to use the solution provided by user1584773 that replaces it with an insert.Liggins
Note that this results in __m256{ 4, 3, 2, 1, 8, 7, 6, 5 } instead of __m256{ 1, 2, 3, 4, 5, 6, 7, 8 }. I think the OP wanted to use _mm_setr_ps instead of _mm_set_ps.Pyroxenite
If you're "inserting" into the lower half", it's usually better to use _mm256_blend_ps instead of _mm256_insertf128_ps. Lower latency and runs on more ports. The only case where vinsertf128 could be better than vblendps ymm, ymm, imm8 is with a memory source, replacing the low lane of a vector with only a 16-byte load, not a 32-byte load.Staggs
S
9

Intel documents __m256 _mm256_set_m128(__m128 hi, __m128 lo) and _mm256_setr_m128(lo, hi) as intrinsics for the vinsertf128 instruction, which is what you want1. (Of course there are also __m256d and __m256i versions, which use the same instruction. The __m256i version may use vinserti128 if AVX2 is available, otherwise it'll use f128 as well.)

These days, those intrinsics are supported by current versions of all 4 major x86 compilers (gcc, clang, MSVC, and ICC). But not by older versions; like some other helper intrinsics that Intel documents, widespread implementation has been slow. (Often GCC or clang are the last hold-out to not have something you wish you could use portably.)

Use it if you don't need portability to old GCC versions: it's the most readable way to express what you want, following the well known _mm_set and _mm_setr patterns.

Performance-wise, it's of course just as efficient as manual cast + vinsertf128 intrinsics (@Mysticial's answer), and for gcc at least that's literally how the internal .h actually implements _mm256_set_m128.

Compiler version support for _mm256_set_m128 / _mm256_setr_m128:

  • clang: 3.6 and newer. (Mainline, IDK about Apple)
  • GCC: 8.x and newer, not present as recently as GCC7!
  • ICC: since at least ICC13, the earliest on Godbolt.
  • MSVC: since at least 19.14 and 19.10 (WINE) VS2015, the earliest on Godbolt.

https://godbolt.org/z/1na1qr has test cases for all 4 compilers.

__m256 combine_testcase(__m128 hi, __m128 lo) {
    return _mm256_set_m128(hi, lo);
}

They all compile this function to one vinsertf128, except MSVC where even the latest version wastes a vmovups xmm2, xmm1 copying a register. (I used -O2 -Gv -arch:AVX to use the vectorcall convention so args would be in registers to make an efficient non-inlined function definition possible for MSVC.) Presumably MSVC would be ok inlining into a larger function if it could write the result to a 3rd register, instead of the calling convention forcing it to read xmm0 and write ymm0.


Footnote 1:
vinsertf128 is very efficient on Zen1, and as efficient as vperm2f128 on other CPUs with 256-bit-wide shuffle units. It can also take the high half from memory in case the compiler spilled it or is folding a _mm_loadu_ps into it, instead of needing to separately do a 128-bit load into a register; vperm2f128's memory operand would be a 256-bit load which you don't want.

https://uops.info/ / https://agner.org/optimize/

Staggs answered 20/12, 2020 at 4:36 Comment(0)
W
2

Even this one will work:

__m128 a = _mm_set_ps(1,2,3,4);
__m128 b = _mm_set_ps(5,6,7,8);

__m256 c = _mm256_insertf128_ps(c,a,0);
c = _mm256_insertf128_ps(c,b,1);

You will get a warning as c is not initialized but you can ignore it and if you're looking for performances this solution will use less clock cycle then the other one.

We answered 11/8, 2012 at 1:11 Comment(2)
Are you sure that this is faster than the solution proposed my Mystical? As far as I know castps128_ps256 is free, isn't it? Moreover, my application greatly benefits from using cast instead of insert (same goes for extract).Verne
@user1829358: The low insert will hopefully optimize away, but no need to make your compiler work to remove stuff that didn't need to be there. (It also has undefined behaviour by reading the not-yet-initialized c, so I would seriously recommend against this.) Yes, cast is clearly better; cast is free in asm and you only need 1 vinsertf128 instruction.Staggs
K
2

Can also use permute intrinsic:

__m128 a = _mm_set_ps(1,2,3,4);
__m128 b = _mm_set_ps(5,6,7,8);
__m256 c = _mm256_permute2f128_ps(_mm256_castps128_ps256(a), _mm256_castps128_ps256(b), 0x20);

I don't know which way is faster.

Kriskrischer answered 21/5, 2015 at 22:15 Comment(1)
If it actually compiles to a vperm2f128, it will be slower on Zen1, and have no advantages on Intel vs. vinsertf128.Staggs
B
0

I believe this is the simplest:

#define _mm256_set_m128(/* __m128 */ hi, /* __m128 */ lo) \ _mm256_insertf128_ps(_mm256_castps128_ps256(lo), (hi), 0x1)

__m256 c = _mm256_set_m128(a, b);

Do note __mm256_set_m128 is already defined in msvc 2019 if you #include "immintrin.h"

Behl answered 20/12, 2020 at 3:43 Comment(2)
Intel documents _mm256_set_m128(__m128 hi, __m128 lo) - you should just use it, not define it yourself.Staggs
correct, it should defined already but just in case you are using an older version of msvc, it may not be definedBehl

© 2022 - 2024 — McMap. All rights reserved.