MinGW64 Is Incapable of 32 Byte Stack Alignment (Required for AVX on Windows x64), Easy Work Around or Switch Compilers?
Asked Answered
A

1

5

I'm trying to work with AVX instructions and windows 64bit. I'm comfortable with g++ compiler so I've been using that, however, there is a big bug described reported here and very rough solutions were presented here.

Basically, m256 variable can't be aligned on the stack to work properly with avx instructions, it needs 32 byte alignment.

The solutions presented at the other stack question I linked are really terrible, especially if you have performance in mind. A python program that you would have to run every time you want to debug that replaces instructions with their sub-optimal unaligned instructions, or over-allocating and doing a bunch of costly hacky pointer math in code to get proper alignment. If you do the pointer math solution, I think there is still even a chance for a seg fault because you can't control the allocation or r-values / temporaries.

I'm looking for an easier and cheaper solution. I don't mind switching compilers, would prefer not to, but if it's the best solution I will. However, my very poor understanding of the bug is that it is intrinsic to windows 64 bit, so would switching compilers help or do other compilers also have the same issue?

Alonzoaloof answered 19/6, 2015 at 0:53 Comment(16)
Doesn't MinGW-w64 have a 32-bit compilation option?Mesomorphic
An extra 32 bytes and some simple pointer math isn't costly when compared to anything you would need 256-bit AVX instructions for.Bouilli
@VermillionAzure 64-bit is pretty important to my applicationAlonzoaloof
@RossRidge: That's not really relevant to this question. The underlying problem is that it's not safe to use AVX instructions in mingw-w64, because it apparently can't align the stack to 32 bytes because it isn't supported by the Windows x64 ABI. Therefore, if you use a __m256 type and the compiler has to spill it onto the stack, you can end up with segmentation faults because it will try to use aligned instructions to move it to/from the stack. It seems like one fix on the compiler side would be to use unaligned moves in this case, but I don't know how feasible that change would be.Riddle
@JasonR You've apparently completely misunderstood what I wrote.Bouilli
@JasonR Even if you wrapped __m256 to get correct alignment with hacky code, the AVX intrinsics still return __m256, which means if you're doing code that requires the use of temporaries, theres always a chance the __m256 temporary would spill out of registers, onto the stack, and then the seg fault will happen, right? So this isn't even a real solutionAlonzoaloof
@RossRidge: I'm not sure what you meant, then. It sounds like you're advocating for some kind of manual implementation of the required alignment. Such hacks aren't really feasible in this case (as there are numerous manipulations of __m256 instances, like temporaries, that only the compiler has control over), but if I'm misunderstanding your recommendation, perhaps you could clarify it.Riddle
@Ragdoll: Exactly; there's no good solution to this problem achievable by just working around the issue in your source code. You would need some level of support at the compiler level to make this feasible. One potential solution would be for the compiler to emit unaligned move instructions when moving to/from the stack. That's essentially what the Python script that you linked does. Unfortunately, contemporary processors have a performance penalty for unaligned 256-bit moves (although 128-bit unaligned moves have been full-speed since the Nehalem architecture).Riddle
I assume this is undesirable for other reasons, but one obvious workaround would be to pass the affected variables by reference rather than by value.Thanasi
@JasonR I was only pointing out an inconsistency in the question. If he has a real use for AVX instructions then costs he was bitterly complaining about are insignificant. Indeed, pretty much those same costs will be paid by having the compiler align the stack automatically.Bouilli
@RossRidge: Agreed. If there was a robust way to implement the manual alignment then it would certainly be a viable solution for any proper application of SIMD instructions.Riddle
Both the Microsoft and Intel compilers manually align the stack at the start of each function call that uses AVX. Why GCC doesn't do this might be related to exception handling.Bile
@Bile Any idea where clang stands on the issue?Alonzoaloof
@JasonR, When you say That's not really relevant to this question. The underlying problem is that it's not safe to use AVX instructions in mingw-w64, because it apparently can't align the stack to 32 bytes because it isn't supported by the Windows x64 ABI. do you mean AVX isn't available with Windows? As it does? Also see Ross' answer - Despite what Kai Tietz said in the bug report you linked, Microsoft's x64 ABI does allow a compiler to give variables a greater than 16-byte alignment on the stack.Varistor
@Varistor Coming from the same GCC bug, MSVC and ICC for Windows don't actually align the stack itself. Instead, they clobber an extra register that points to an aligned portion on the stack. (r13 in the case of ICC.) All local variables (as well as spilled ymm/zmm values) that require >16-byte alignment are then placed in this section. This also has nothing to do with MSVC and ICC using unaligned load/stores. They do that for a completely different reason (they unconditionally use unaligned access for everything).Bile
@Mysticial, I really think your comment should go here - gcc.gnu.org/bugzilla/show_bug.cgi?id=54412 and here - github.com/Alexpux/MSYS2-packages/issues/1209 (Maybe also here - sourceforge.net/p/mingw-w64/mailman/message/34485783). People trying to fix it could use your knowledge (I'm not an expert in those area). Thank You.Varistor
B
4

You can solve this problem by switching to Microsoft's 64-bit C/C++ compiler. The problem is not intrinsic to 64-bit Windows. Despite what Kai Tietz said in the bug report you linked, Microsoft's x64 ABI does allow a compiler to give variables a greater than 16-byte alignment on the stack.

Also Cygwin's 64-bit version of GCC 4.9.2 can give variables 32-byte alignment on the stack.

Clang for Windows also makes working executables with AVX, and is a good choice in terms of optimizing well.

Bouilli answered 19/6, 2015 at 2:42 Comment(6)
guess its time to switch to visual studioAlonzoaloof
@Ragdoll I just checked Cygwin and it supports it as well.Bouilli
thats odd I thought cygwins gcc compiler would be the same, i didn't think cygwin modified the compiler just other things in the toolchainAlonzoaloof
Any chance they will solve it? Is Kai aware of what you're saying?Varistor
@Varistor Sorry, I have no idea. I don't follow MinGW development anymore and never followed MinGW-w64.Bouilli
@Royi: See also How to align stack at 32 byte boundary in GCC? for a Python script that can hack up the asm to change vmovaps to vmovups etc.Adorn

© 2022 - 2024 — McMap. All rights reserved.