Get member of __m128 by index?
Asked Answered
D

5

30

I've got some code, originally given to me by someone working with MSVC, and I'm trying to get it to work on Clang. Here's the function that I'm having trouble with:

float vectorGetByIndex( __m128 V, unsigned int i )
{
    assert( i <= 3 );
    return V.m128_f32[i];
}

The error I get is as follows:

Member reference has base type '__m128' is not a structure or union.

I've looked around and found that Clang (and maybe GCC) has a problem with treating __m128 as a struct or union. However I haven't managed to find a straight answer as to how I can get these values back. I've tried using the subscript operator and couldn't do that, and I've glanced around the huge list of SSE intrinsics functions and haven't yet found an appropriate one.

Dorsal answered 27/9, 2012 at 15:6 Comment(5)
It's useful to know that the original intrinsic interface intentionally left out this functionality because there is no efficient way to do it efficiently in the hardware. Compilers (like MSVC) will provide extensions (like m128_f32) to do it. But it only masks the performance problem.Jaxartes
Yeah I understand that it completely removes the benefit of processing the vector at once - I think this particular function isn't intended for heavy use. Since I'm porting though, I'd like to avoid leaving anything unimplemented.Dorsal
@Jaxartes on gcc, when using -mfpmath=sse (and on msvc when generating 64 bit code), floating point values are in sse registers. You can return the least significant part of a vector efficiently using _mm_cvtss_f32(V), and other elements by first shuffling the desired value into the low element.Lawmaker
It's worth noting that with newer builds of clang, you can just do return V[i].Bedlamite
@Dorsal a note - It's pretty clear that there should be an [i] after the V.m128_f32 - since you say this works on MSVC. And that change obviously doesn't affect the clang error message, and the detail isn't really material to what you're asking. I've tried twice to submit this as an edit to the question, but most reviewers feel that I'm changing the intent of the question, so it's not happening.Senlac
O
21

A union is probably the most portable way to do this:

union {
    __m128 v;    // SSE 4 x float vector
    float a[4];  // scalar array of 4 floats
} U;

float vectorGetByIndex(__m128 V, unsigned int i)
{
    U u;

    assert(i <= 3);
    u.v = V;
    return u.a[i];
}
Oratorian answered 27/9, 2012 at 15:16 Comment(10)
MSDN says you should not do this, leaving access via load and set operators.Devisor
@Steve-o: gcc says otherwise: gcc.gnu.org/onlinedocs/gcc-4.7.2/gcc/…Martinmartina
I've seen a lot of contradictory opinions about this approach, but it looks like it's the only viable option on Mac. Thanks!Dorsal
It's certainly the most portable way, IME - it works on Mac OS X, Linux, Windows, with gcc, clang, ICC, MSVC, etc.Oratorian
@hirschhornsalz It's not clear to me that u.a[i] really fits the first "supported" example at your link, and not the second "might not work" example -- since u.a produces a float * and then the dereferencing is done separately by [i]. I use this technique myself, but I find it pretty annoying that there seems to be no 'official' way to do this kind of type punning (even this here is something that GCC says will work; it's not guaranteed by the language). Is it necessary to actually memcpy one struct to another?Senlac
I'd also add that the use of a union for type-punning is often elided by gcc, so that the implied store & load are not done - in cases where both views of the union fit in a register. So this makes it the most efficient technique for many type puns (in addition to being the safest). But in this particular example, that isn't going to happen since there's no opcode can that extract element [i] from an sse register.Senlac
The union is not well defined in C++; see Accessing inactive union member and undefined behavior? GCC and SunCC have produced bad results with it on occasion. GCC under Aarch64 and SunCC on x86_64.Whoosh
@jww: typically any compiler which supports SIMD intrinsics as an extension has to also support the use of unions like this. I believe it's allowed in C99 and later, so it's presumably not too hard to make it work properly in C++ also.Oratorian
Fundamentally, if this does work, the compiler is doing the heavy lifting and under the hood you will be getting load / store operations simply by accessing the component. It's better to be explicit about when you do the load / store operations because especially on platforms with poor memory bandwidth, this can cause a lot of lost time. And isn't the point of using vectors to optimize?Charinile
@johnb003: compilers are getting surprisingly good at optimising this kind of pattern, but from a performance perspective if you’re doing this kind of thing inside a performance-critical loop then there is probably a better way - typically you only want to access individual elements after a loop, e.g. when performing a reduction operation.Oratorian
E
22

As a modification to hirschhornsalz's solution, if i is a compile-time constant, you could avoid the union path entirely by using a shuffle:

template<unsigned i>
float vectorGetByIndex( __m128 V)
{
    // shuffle V so that the element that you want is moved to the least-
    // significant element of the vector (V[0])
    V = _mm_shuffle_ps(V, V, _MM_SHUFFLE(i, i, i, i));
    // return the value in V[0]
    return _mm_cvtss_f32(V);
}

A scalar float is just the bottom element of an XMM register, and the upper elements are allowed to be non-zero; _mm_cvtss_f32 is free and will compile to zero instructions. This will inline as just a shufps (or nothing for i==0).

Compilers are smart enough to optimize away the shuffle for i==0 (except for long-obsolete ICC13) so no need for an if (i). https://godbolt.org/z/K154Pe. clang's shuffle optimizer will compile vectorGetByIndex<2> into movhlps xmm0, xmm0 which is 1 byte shorter than shufps and produces the same low element. You could manually do this with switch/case for other compilers since i is a compile-time constant, but 1 byte of code size in the few places you use this while manually vectorizing is pretty trivial.


Note that SSE4.1 _mm_extract_epi32(V, i); is not a useful shuffle here: extractps r/m32, xmm, imm can only extract the FP bit-pattern to an integer register or memory (https://www.felixcloutier.com/x86/extractps). (And the intrinsic returns it as an int, so it would actually compile to extractps + cvtsi2ss to do int->float conversion on the FP bit-pattern, unless you type-pun it in your C++ code. But then you'd expect it to compile to extractps eax, xmm0, i / movd xmm0, eax which is terrible vs. shufps.)

The only case where extractps would be useful is if the compiler wanted to store this result straight to memory, and fold the store into the extract instruction. (For i!=0, otherwise it would use movss). To leave the result in an XMM register as a scalar float, shufps is good.

(SSE4.1 insertps would be usable but unnecessary: it makes it possible to zero other elements while taking an arbitrary source element.)

Emaemaciate answered 27/9, 2012 at 16:46 Comment(3)
This should use _mm_cvtss_f32 instead of _mm_store_ss.Fu
@Dan: Good point; I missed that intrinsic when I wrote the answer. It makes it more compact, for sure.Emaemaciate
_mm_extract_epi32 extracts the FP bit-pattern as an integer (in a GP register!), and then converts that integer to float. _mm_extract_ps (SSE4.1 extractps) does the same thing. Just remove the __SSE4_1__ part and this function will be pretty good.Bettinabettine
O
21

A union is probably the most portable way to do this:

union {
    __m128 v;    // SSE 4 x float vector
    float a[4];  // scalar array of 4 floats
} U;

float vectorGetByIndex(__m128 V, unsigned int i)
{
    U u;

    assert(i <= 3);
    u.v = V;
    return u.a[i];
}
Oratorian answered 27/9, 2012 at 15:16 Comment(10)
MSDN says you should not do this, leaving access via load and set operators.Devisor
@Steve-o: gcc says otherwise: gcc.gnu.org/onlinedocs/gcc-4.7.2/gcc/…Martinmartina
I've seen a lot of contradictory opinions about this approach, but it looks like it's the only viable option on Mac. Thanks!Dorsal
It's certainly the most portable way, IME - it works on Mac OS X, Linux, Windows, with gcc, clang, ICC, MSVC, etc.Oratorian
@hirschhornsalz It's not clear to me that u.a[i] really fits the first "supported" example at your link, and not the second "might not work" example -- since u.a produces a float * and then the dereferencing is done separately by [i]. I use this technique myself, but I find it pretty annoying that there seems to be no 'official' way to do this kind of type punning (even this here is something that GCC says will work; it's not guaranteed by the language). Is it necessary to actually memcpy one struct to another?Senlac
I'd also add that the use of a union for type-punning is often elided by gcc, so that the implied store & load are not done - in cases where both views of the union fit in a register. So this makes it the most efficient technique for many type puns (in addition to being the safest). But in this particular example, that isn't going to happen since there's no opcode can that extract element [i] from an sse register.Senlac
The union is not well defined in C++; see Accessing inactive union member and undefined behavior? GCC and SunCC have produced bad results with it on occasion. GCC under Aarch64 and SunCC on x86_64.Whoosh
@jww: typically any compiler which supports SIMD intrinsics as an extension has to also support the use of unions like this. I believe it's allowed in C99 and later, so it's presumably not too hard to make it work properly in C++ also.Oratorian
Fundamentally, if this does work, the compiler is doing the heavy lifting and under the hood you will be getting load / store operations simply by accessing the component. It's better to be explicit about when you do the load / store operations because especially on platforms with poor memory bandwidth, this can cause a lot of lost time. And isn't the point of using vectors to optimize?Charinile
@johnb003: compilers are getting surprisingly good at optimising this kind of pattern, but from a performance perspective if you’re doing this kind of thing inside a performance-critical loop then there is probably a better way - typically you only want to access individual elements after a loop, e.g. when performing a reduction operation.Oratorian
M
18

Use

template<unsigned i>
float vectorGetByIndex( __m128 V) {
    union {
        __m128 v;    
        float a[4];  
    } converter;
    converter.v = V;
    return converter.a[i];
}

which will work regardless of the available instruction set.

Note: Even if SSE4.1 is available and i is a compile time constant, you can't use pextract etc. this way, because these instructions extract a 32-bit integer, not a float:

// broken code starts here
template<unsigned i>
float vectorGetByIndex( __m128 V) {
    return _mm_extract_epi32(V, i);
}
// broken code ends here

I don't delete it because it is a useful reminder how to not do things.

Martinmartina answered 27/9, 2012 at 15:47 Comment(7)
This is wrong. 1) _mm_extract_epi32 takes __m128i as its first parameter, you are passing __m128 - the code won't compile. 2) if you solved that with _mm_castps_si128, _mm_extract_epi32 returns the raw floating point value as an integer in a general register (such as eax). 3) That value would be the int converted to a float: 1.0f == 0x3F800000 = 1,065,353,216. For 1.0f your code will return 1.06535e+09 on SSE4. 4) Even if you solved that with casts and reinterpreted dereference, it would be inefficient. Your code uses the wrong intrinsics. Use _mm_shuffle_ps and _mm_cvtss_f32.Lawmaker
@Lawmaker This code works and compiles actually fine. Do you use MSVC?Martinmartina
no, I use gcc, I run linux. I even tested it, your code doesn't compile. You probably aren't passing -msse4 to gcc so the #ifdef is always false. I tested it here I had to change the _mm_extract_epi32 to _mm_extract_ps for it to compile. Extract gets a value into a general (integer) register with the movd instruction.Lawmaker
You are right, _mm_extract_epi32 does not work for floats - I had taken the code from a similar integer routine without further checking the #ifdef __SSE4_1__ case. Thank you for your remarks.Martinmartina
No worries, SSE intrinsics are very error prone. We all make those mistakes.Lawmaker
The "better" way with the union is not well defined in C++; see Accessing inactive union member and undefined behavior?. GCC and SunCC have produced bad results with it on occasion. GCC under Aarch64 and SunCC on x86_64.Whoosh
@jww: does SunCC support Intel Intrinsics? I was under the impression that every compiler that supported __m128 defined the behaviour of union type-punning, C99-style. GNU C++ and GNU C89 do define the behaviour of this use of unions, so if your testing was having a problem on GCC then you were probably doing it wrong.Bettinabettine
C
3

The way I use is

union vec { __m128 sse, float f[4] };

float accessmember(__m128 v, int index)
{
    vec v.sse = v;
    return v.f[index];
}

Seems to work out pretty well for me.

Casmey answered 10/7, 2013 at 4:16 Comment(3)
The union is not well defined in C++; see Accessing inactive union member and undefined behavior?. GCC and SunCC have produced bad results with it on occasion. GCC under Aarch64 and SunCC on x86_64.Whoosh
@jww: GNU C++ does define the behaviour of union type-punning in C++ to match C99. If this failed with GCC, it was either a compiler bug or due to other undefined behaviour elsewhere in the program. gcc.gnu.org/onlinedocs/gcc/…. (Applies to GNU C++ as per the "Some choices are documented in the corresponding document for the C language. See C Implementation." wording in gcc.gnu.org/onlinedocs/gcc/C_002b_002b-Implementation.html.Bettinabettine
This is only GNU C++ of course, not ISO C++, and in GNU / clang implementations of __m128, you can simply index the __m128 itself, like v[index], because it's defined as a GNU C native vector like typedef float __m128 __attribute((vector_size(16),may_alias));. Of course the union hack is portable to some other C++ compilers like MSVC that define __m128 differently. So it can be useful, but need to be careful to mention that it's not strictly portable to all C++ implementations that support Intel intriniscs.Bettinabettine
S
1

Late to this party but found that this works for me in MSVC where z is a variable of type __m128.

#define _mm_extract_f32(v, i)       _mm_cvtss_f32(_mm_shuffle_ps(v, v, i))

__m128 z = _mm_setr_ps(1.0, 2.0, 3.0, 4.0);

float f = _mm_extract_f32(z, 2);

OR even simpler

__m128 z;

float f = z.m128_f32[2];  // to get the 3rd float value in the vector
Slimy answered 7/1, 2021 at 21:19 Comment(1)
Yes, MSVC defines __m128 types as a union of various element sizes. This is completely non-portable to other compilers like GCC or clang, so I'd recommend against it unless you wrap it in a helper function you can #ifdef later for portability. i.e. do not scatter .m128_f32 access throughout your code. Defining a union of __m128 and float[4] will work on MSVC and some (but not all) other compilers, see @Jordan's answer and comments. Or just _mm_store_ps to a tmp array and reload, letting the compiler optimize it into a shuffle if it wants.Bettinabettine

© 2022 - 2024 — McMap. All rights reserved.