How to efficiently convert from two __m128d to one __m128i in MSVC?
Asked Answered
W

1

1

Is converting then shifting then bitwise-or'ing the only way to convert from two __m128d to a single __m128i?

This is perfectly acceptable to Xcode in an x64 build

m128d v2dHi = ....
m128d v2dLo = ....
__m128i v4i = _mm_set_epi64(_mm_cvtpd_pi32(v2dHi), _mm_cvtpd_pi32(v2dLo))

and the disassembly shows _mm_cvtpd_pi32 being used. However, Visual Studio cannot compile this, complaining about a linker error. This is supported in the VS docs, saying _mm_cvtpd_pi32 is not supported on x64.

I'm not too worried that it's not available, but is two conversions, a shift, then a bitwise-or the fastest way?

Wilbertwilborn answered 15/9, 2016 at 4:24 Comment(0)
C
2

If you got a linker error, you're probably ignoring a warning about an undeclared intrinsic function.

Your current code has a high risk of compiling to terrible asm. If it compiled to a vector-shift and an OR, it already is compiling to sub-optimal code. (Update: that's not what it compiles to, IDK where you got that idea.)

Use 2x _mm_cvtpd_epi32 to get two __m128i vectors with ints you want in the low 2 elements of each. Use _mm_unpacklo_epi64 to combine those two low halves into one vector with all 4 elements you want.


Compiler output from clang3.8.1 on the Godbolt compiler explorer. (Xcode uses clang by default, I think).

#include <immintrin.h>

// the good version
__m128i pack_double_to_int(__m128d a, __m128d b) {
    return _mm_unpacklo_epi64(_mm_cvtpd_epi32(a), _mm_cvtpd_epi32(b));
}
    cvtpd2dq        xmm0, xmm0
    cvtpd2dq        xmm1, xmm1
    punpcklqdq      xmm0, xmm1      # xmm0 = xmm0[0],xmm1[0]
    ret

// the original
__m128i pack_double_to_int_badMMX(__m128d a, __m128d b) {
    return _mm_set_epi64(_mm_cvtpd_pi32(b), _mm_cvtpd_pi32(a));
}
    cvtpd2pi        mm0, xmm1
    cvtpd2pi        mm1, xmm0
    movq2dq xmm1, mm0
    movq2dq xmm0, mm1
    punpcklqdq      xmm0, xmm1      # xmm0 = xmm0[0],xmm1[0]
      # note the lack of EMMS, because of not using the intrinsic for it
    ret

MMX is almost totally useless when SSE2 and later is available; just avoid it. See the tag wiki for some guides.

Cornstarch answered 15/9, 2016 at 4:32 Comment(2)
Xcode didn't optimise it away. The disassembly shows _mm_cvtpd_pi32 being used, and _mm_set_epi64 is just using mov to store the values.Wilbertwilborn
Yep, it works: _mm_unpacklo_epi64(_mm_cvtpd_epi32(v2dLo), _mm_cvtpd_epi32(v2dHi))Wilbertwilborn

© 2022 - 2024 — McMap. All rights reserved.