Using AVX instructions disables exp() optimization?
Asked Answered
C

1

7

I am writing a feed forward net in VC++ using AVX intrinsics. I am invoking this code via PInvoke in C#. My performance when calling a function that calculates a large loop including the function exp() is ~1000ms for a loopsize of 160M. As soon as I call any function that uses AVX intrinsics, and then subsequently use exp(), my performance drops to about ~8000ms for the same operation. Note that the function calculating the exp() is standard C, and the call that uses the AVX intrinsics can be completely unrelated in terms of data being processed. Some kind of flag is getting tripped somewhere at runtime.

In other words,

A(); // 1000ms calculates 160M exp() 
B(); // completely unrelated but contains AVX
A(); // 8000ms

or, curiously,

C(); // contains 128 bit SSE SIMD expressions
A(); // 1000ms

I am lost as to what possible mechanism is going on here, or how to pursue a sol'n. I'm on an Intel 2500K cpu\Win 7. Express versions of VS.

Thanks.

Chiao answered 1/5, 2011 at 23:30 Comment(0)
L
10

If you use any AVX256 instruction, the "AVX upper state" becomes "dirty", which results in a large stall if you subsequently use SSE instructions (including scalar floating-point performed in the xmm registers). This is documented in the Intel Optimization Manual, which you can download for free (and is a must-read if you're doing this sort of work):

AVX instruction always modifies the upper bits of YMM registers and SSE instructions do not modify the upper bits. From a hardware perspective, the upper bits of the YMM register collection can be considered to be in one of three states:

• Clean: All upper bits of YMM are zero. This is the state when processor starts from RESET.

• Modified and saved to XSAVE region The content of the upper bits of YMM registers matches saved data in XSAVE region. This happens when after XSAVE/XRSTOR executes.

• Modified and Unsaved: The execution of one AVX instruction (either 256-bit or 128-bit) modifies the upper bits of the destination YMM.

The AVX/SSE transition penalty applies whenever the processor states is “Modified and Unsaved“. Using VZEROUPPER move the processor states to “Clean“ and avoid the transition penalty.

Your routine B( ) dirties the YMM state, so the SSE code in A( ) stalls. Insert a VZEROUPPER instruction between B and A to avoid the problem.

Lombardy answered 4/5, 2011 at 23:7 Comment(4)
I'll be goshdarned. It worked. I take it this means that exp() is using 128 bit SSE not 256 bit code. I'm not familiar enough to know if that's something that can be conveniently converted.Chiao
@AronMiller: Happy to help. Make sure you use VZEROUPPER anytime you've used AVX and are passing control to code that you don't own. And be sure to file a bug against your compiler to try to get them to insert it for you in those cases.Lombardy
There are direct translations from SSE to AVX128, documented in the instruction reference manuals. I think that ICC can do the conversion for you, but I don't know of any other compilers doing this yet.Lombardy
Thanks for the document reference as well.Chiao

© 2022 - 2024 — McMap. All rights reserved.