Converting a C++ project to x64 with __m64 references

Asked 7/9, 2015 at 21:12 Answered 7/9, 2015 at 21:33

So when I started the conversion and set the target to 'x64', I get 7 unresolved externals. Two examples:

error LNK2001: unresolved external symbol _m_empty    ...CONVOLUTION_2D_USHORT.obj  CONVOLUTION_2D_USHORT
error LNK2001: unresolved external symbol _mm_setzero_si64  ...CONVOLUTION_2D_USHORT.obj    CONVOLUTION_2D_USHORT

So I tried investigating these a bit deeper, and I found that it doesn't like the __m64 inside the header files: Specifically mmintrin.h (there might be others). In my amateur hour with C++, because I haven't messed with the language in years, (I'm usually in the C# department), I attempted to edit the header files, and replace __m64 with __m128i ??!!. Don't know what is the correct route, to get this and other DLLs to compile against MachineX64. After editing and putting the source of the header in my local directory, it still doesn't allow me to compile via right-click... again-Amateur-hour. There has been a few people that have asked similar questions, but I couldn't find the right one for me.

Here is a sample of 'mmintrin.h' with unsupported __m64...

typedef union __declspec(intrin_type)_CRT_ALIGN(8) __m64
{
unsigned __int64    m64_u64;
float               m64_f32[2];
__int8              m64_i8[8];
__int16             m64_i16[4];
__int32             m64_i32[2];
__int64             m64_i64;
unsigned __int8     m64_u8[8];
unsigned __int16    m64_u16[4];
unsigned __int32    m64_u32[2];
} __m64;

/* General support intrinsics */
void  _m_empty(void);
__m64 _m_from_int(int _I);
int   _m_to_int(__m64 _M);
__m64 _m_packsswb(__m64 _MM1, __m64 _MM2);
__m64 _m_packssdw(__m64 _MM1, __m64 _MM2);
__m64 _m_packuswb(__m64 _MM1, __m64 _MM2);
__m64 _m_punpckhbw(__m64 _MM1, __m64 _MM2);
__m64 _m_punpckhwd(__m64 _MM1, __m64 _MM2);
__m64 _m_punpckhdq(__m64 _MM1, __m64 _MM2);
__m64 _m_punpcklbw(__m64 _MM1, __m64 _MM2);
__m64 _m_punpcklwd(__m64 _MM1, __m64 _MM2);
__m64 _m_punpckldq(__m64 _MM1, __m64 _MM2);
...

Motorboat answered 7/9, 2015 at 21:12 Comment(2)

did you get 2 unsolved externals or 7? – Punctilio 7/9, 2015 at 21:20

7.. !drive.google.com/file/d/0B3qrpuwM39vmM1lGazR2WWhRamM/… – Motorboat 7/9, 2015 at 21:33

From the __m64 type documentation:

The __m64 data type is not supported on x64 processors. Applications that use __m64 as part of MMX intrinsics must be rewritten to use equivalent SSE and SSE2 intrinsics.

http://msdn.microsoft.com/en-us/library/08x3t697(v=vs.110).aspx

So it looks like you have three options: stick with 32 bits, port the MMX intrinsics to SSE, or fall back to a non-SIMD implementation (if you have one - if not then consider re-implementing in scalar code).

Incisure answered 7/9, 2015 at 21:33 Comment(19)

Port the MMX intrinsics to SSE... How to do that? Link... Is it step by step process of moving the references in the cpp, or is more involved? – Motorboat 7/9, 2015 at 21:38

@RobertKoernke: Intrinsics operate at a very low level of abstraction -- there exists no mapping that provides exact equivalents for all MMX operations. Are you aware that the higher abstraction level valarray is now standardized in C++? – Thibodeau 7/9, 2015 at 21:40

Anything intrinsic uses the __m64 type is MMX or 3DNow! and is therefore deprecated for x64 native. Hopefully the code base has a C++ implementation of the intrinsics optimized functions you could fall back on, but either way you have to reimplement the MMX as SSE to be portable. The good news is that once you do that, it can still be built for x86 as well as x64 native. – Arjan 7/9, 2015 at 21:42

Looks like "c++ simd template" might be a good search phrase to find other libraries that don't require you to get down into the details of a single ISA (and rewrite every time your hardware gets <strike>upgraded</strike> replaced) – Thibodeau 7/9, 2015 at 21:45

code_mm_empty(); __m64 accu, temp; __m64 shifter = _m_from_int(32); Are all declared in the .cpp file, are you saying I need to do something with 'valarray'? – Motorboat 7/9, 2015 at 21:45

@ChuckWalbourn: Rewriting as SSE is not "portable". – Thibodeau 7/9, 2015 at 21:45

It's portable in the sense that the same SSE intrinsics will compile for both x86 and x64 native (i.e. you don't need to maintain distinct codepaths for x86 vs. x64 native). SSE/SSE2 support is ubiquitous these days. These facts are used heavily for DirectXMath. Obviously they won't compile for other architectures like ARM or PPC, so intrinsics optimized code should have a standard C/C++ code path as a fallback maintained for improved portability. – Arjan 7/9, 2015 at 21:48

Ultimately the now two choices that Paul offers above (other than reverting to 32) are still over my head. Unless someone can show me a better link that details what he is talking about. – Motorboat 7/9, 2015 at 21:58

@Robert - Depending on what your code does, perhaps using 64-bit intrinsics isn't all that useful when all the code is now 64 bit anyway? – Alkalimeter 7/9, 2015 at 22:11

@RobertKoernke: Use a compiler that can compile 64bit code that uses MMX (with intrinsics). For example, gcc has no problem with #include <mmintrin.h> void emms(void){ _m_empty(); } virtualdub.org/blog/pivot/entry.php?id=107 says MMX is usable in 64bit Windows applications. So if you're having a problem, it's your compiler's fault. Sometimes porting MMX code to SSE is easy, and will even give a speedup. Other times, it would mean you'd have to adapt the calling code to do two 8x8 SADs in parallel or something. (e.g. ffmpeg's mpdecimate filter) – Waxwork 8/9, 2015 at 3:45

I can find ways to rewrite my .cpp so that it calls 'emmintrin.h' SSE2 instead of 'mmintrin.h'. But I cant resolve all of them. Some are simple to notice others not. Also I don't know what I'm doing. You guys are still speaking over my head. IE: Calling _mm_setzero_si128i instead of _mm_setzerosi64 was simple... But there seems to be no _m_empty or _m_from_int? What am I looking for? Again I can't find a link that shows the conversion process. – Motorboat 8/9, 2015 at 14:59

@RobertKoernke: this is really not something you can do "blindly", without understanding the underlying operations - just converting all 64 bit operations to 128 bits will almost certainly not give correct results. It looks like the code in question is a 2D convolution for 16 bit unsigned values, which should be very simple to implement in normal (scalar) code. Check to see if there is a scalar implementation in the code already (there really should be), but if not then consider writing one to replace the MMX code. – Incisure 8/9, 2015 at 15:3

Indeed it did compile but did not work as you said. My learning curve for GCC @Peter seems even longer. I tried going that route for a bit. So I am definitely stuck. @Paul, I don't know how to convert to scaler. code In for loop: temp = _m_pmaddwd((__m64)lpKernel, (__m64)lpInnerPixels); accu = _mm_add_pi32(accu, temp); // each double word has a partial sum After for loop: // copy hi-dword of mm0 to lo-dword of mm1, then sum mmo+mm1 // and finally store the result into the variable "accu" accu = _mm_add_pi32(accu, _mm_srl_si64(accu, shifter)); // combine results from – Motorboat 8/9, 2015 at 20:19

I suggest you post the original MMX code in a new question and ask for help in porting it to scalar code and/or SSE - if you tag it [mmx] then I'll get a notification and will take a look it tomorrow, and I'm sure others will also be interested. – Incisure 8/9, 2015 at 20:45

@RobertKoernke: looks like a really simple summing of a[i] * b[i], starting with pmaddwd to 16*16->32bit multiply and add adjacent pairs. Intel made the intrinsic names more verbose for SSE than for mmx, but you can see in their online intrinsics guide that MMX _m_pmaddwd and SSE _mm_madd_epi16 are the same instruction, but for xmm instead of mmx registers. The stuff after the loop does a horizontal sum of the elements the vector accumulator. (Two 32bit ints, packed into a 64bit MMX register, in your case.) This is worth converting to SSE. Like Paul says, post a new Q if needed. – Waxwork 8/9, 2015 at 21:22

@RobertKoernke: There is no SSE _mm_empty. The EMMS instruction is needed between MMX code and x87 math that uses the FPU registers, because MMX just re-purposed the FPU registers instead of introducing new architectural state. They did that with SSE (the 128bit XMM registers) after seeing that MMX was effective, and would be even moreso with wider vectors. See stackoverflow.com/tags/sse/info. Also stackoverflow.com/tags/x86/info. If you find good resources that should be on one of those wikis, please edit them to put in a link. – Waxwork 8/9, 2015 at 21:28

Yeah, when I recompiled I just commented _mm_empty out, and thought that might be ok...hehe. @PeterCordes I did already find that madd in my conversion to SSE. But obviously something I did blew up. – Motorboat 8/9, 2015 at 22:4

@RobertKoernke: Ask a new question with the code you have so far. SSE memory access faults on unaligned, unless you use one of the unaligned load intrinsics (_mm_loadu_si128). AVX reverts that decision, and allows unaligned addresses, except for the aligned load / store intrinsics. I'm still surprised there isn't a reasonable way to compile MMX intrinsics in a project that uses Visual C++ stuff. Maybe you could compile your MMX code with gcc or clang, and link that in to your Visual C++ project? You'd prob. need to extern "C" the functions, because of different name mangling. – Waxwork 9/9, 2015 at 1:24

Added [link]#32479570 – Motorboat 9/9, 2015 at 12:3

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags