How to control whether C math uses SSE2?

Asked 9/3, 2013 at 18:32 Answered 9/3, 2013 at 20:46

Solved c++visual-c++floating-point sse deterministic

I stepped into the assembly of the transcendental math functions of the C library with MSVC in fp:strict mode. They all seem to follow the same pattern, here's what happens for sin.

First there is a dispatch routine from a file called "disp_pentium4.inc". It checks if the variable ___use_sse2_mathfcns has been set; if so, calls __sin_pentium4, otherwise calls __sin_default.

__sin_pentium4 (in "sin_pentium4.asm") starts by transferring the argument from the x87 fpu to the xmm0 register, performs the calculation using SSE2 instructions, and loads the result back in the fpu.

__sin_default (in "sin.asm") keeps the variable on the x87 stack and simply calls fsin.

So in both cases, the operand is pushed on the x87 stack and returned on it as well, making it transparent to the caller, but if ___use_sse2_mathfcns is defined, the operation is actually performed in SSE2 rather than x87.

This behavior is very interesting to me because the x87 transcendental functions are notorious for having slightly different behaviors depending on the implementation, whereas a given piece of SSE2 code should always give reproducible results.

Is there a way to determine for certain, either at compile or run-time, that the SSE2 code path will be used? I am not proficient writing assembly, so if this involves writing any assembly, a code example would be appreciated.

Scapular answered 9/3, 2013 at 18:32 Comment(4)

What directory are these files located in? – Perorate 9/3, 2013 at 18:35

"f:\dd\vctools\crt_bld\SELF_X86\crt\prebuild\tran\i386\" - this is just what I see in the dissassembly, I don't have the files themselves. – Scapular 9/3, 2013 at 18:36

Nah, impossible, a decent question on SO... I feel an embarassing urge to upvote. – Aurelia 9/3, 2013 at 18:37

If this is based on a variable called ___use_sse2_mathfcns, can't you test that same variable? (you may need to declare it, and I don't know how private it is) – Sorbose 9/3, 2013 at 18:46

I found the answer through careful investigation of math.h. This is controlled by a method called _set_SSE2_enable. This is a public symbol documented here:

Enables or disables the use of Streaming SIMD Extensions 2 (SSE2) instructions in CRT math routines. (This function is not available on x64 architectures because SSE2 is enabled by default.)

This causes the aforementionned ___use_sse2_mathfcns flag to be set to the provided value, effectively enabling or disabling use of the _pentium4 SSE2 routines.

The documentation mentions this affects only certain transcendental functions, but looking at the disassembly, this seems to affect everyone of them.

Edit: stepping into every function reveals that they're all available in SSE2 except for the following:

fmod
sinh
cosh
tanh
sqrt

Sqrt is the biggest offender, but it's trivial to implement in SSE2 using intrinsics. For the others, there's no simple solution except perhaps using a third-party library, but I can probably do without.

Scapular answered 9/3, 2013 at 20:46 Comment(1)

5+ years after this question was asked, I was investigating why the SSE math functions were not called in debug builds with VS2015.3. Then found this: "Some standard C/C++ library functions are available in intrinsic implementations on some architectures. When calling a CRT function, the intrinsic implementation is used if /Oi is specified on the command line." (from learn.microsoft.com/en-us/cpp/intrinsics/…) – Shalna 6/9, 2018 at 18:14

Why not use your own library instead of the C runtime? This would provide an even stronger guarantee of consistency across computers (presumably the C runtime is provided as a DLL and might change slightly in time).

I would recommend CRlibm. If you are already targeting SSE2, and as long as you did not intend to change the FPU's rounding mode, you are in the ideal conditions to use it, and you won't find a more accurate implementation.

Napiform answered 9/3, 2013 at 18:52 Comment(10)

I actually need to set the precision and rounding mode of the FPU: 53-bits and round-to-nearest. This shouldn't have any incidence on CRlibm routines if they are implemented in SSE2 though. – Scapular 9/3, 2013 at 18:56

@Dr_Asik The emulation mode of the historical floating-point stack instruction is not perfect: with when set for 53-bit significand, the exponent keeps its full extended range. In particular it is tricky to avoid double-rounding denormals. Nevertheless, CRlibm is designed to work even when targeting the x87 as long as it is set in the emulation mode that you refer to: lipforge.ens-lyon.fr/www/crlibm/start.html – Napiform 9/3, 2013 at 18:59

+1 for CRLibM. In addition to guaranteed perfect accuracy it is usually faster than GNU LibM and Cephes C. In case you want to trade some accuracy for performance have a look at FDLibM. – Hammering 9/3, 2013 at 19:7

@PascalCuoq are you saying that setting x87 for 53-bit precision isn't enough to guarantee reproducible results for non-transcendental functions (arithmetic)? I know some multiplayer games use only floats and set x87 to 24-bit precision, and manage to get consistency this way. Are the issues similar with 24-bit and 53-bit precision? What do you base this statement on? – Scapular 9/3, 2013 at 22:7

@Dr_Asik I am saying that 53-bit-configured x87 is different from IEEE 754 double-precision (what SSE2 implements). I base my statement on arxiv.org/abs/cs/0701192 but really, anyone can reproduce the issue and verify it for themselves. It is discussed p16. 24-bit precision is easier to make reproducible because of theorems shown in dl.acm.org/citation.cfm?id=221334 but you still need to write code that checks for intermediate overflows/underflows or hope that the compiler will do it for you. – Napiform 9/3, 2013 at 22:17

@Dr_Asik That second paper is not online but the PhD report is just as good: cs.nyu.edu/web/Research/Theses/figueroa_sam.pdf (p50) – Napiform 9/3, 2013 at 22:36

@PascalCuoq Thanks for the references. My project is .NET 32-bit, which only emits x87 code; I was hoping to keep the arithmetic as is and delegate only the transcendental functions to native code compiled for SSE2. Looks like x87 just sucks beyond belief and I'll have to delegate everything. Sigh. – Scapular 10/3, 2013 at 2:34

@PascalCuoq: The page you cite in the thesis addresses double rounding, as would occur if a 64-bit significand were rounded to 53 bits. The x87 FPU has a precision control bit that limits the significand to 53 bits. Do you have reason to believe that the FPU then operates by rounding to 64-bit results and then rounding again to 53-bit results rather than performing a single 53-bit rounding? – Fulminant 10/3, 2013 at 3:13

@EricPostpischil There are two distinct issues. 1/ general double rounding when computing with 64-bit significand and using an option such as GCC's -ffloat-store to then round to 53-bit significand. 2/ When configuring the x87 for a narrower significand, double rounding of denormals caused by the fact that the x87 retains a full range for the exponent. Double-precision can never be emulated perfectly by the x87 because of 2: that double rounding of denormals is unescapable even if code is added to patch up under/overflows. – Napiform 10/3, 2013 at 8:49

@EricPostpischil Single-precision computations can be emulated perfectly with an x87, and this paradoxically requires NOT configuring the x87 for a narrower significand (which would cause double rounding for denormals) but leaving it alone and rounding each intermediate result to single precision (which is “innocuous” according to the PhD report I linked) – Napiform 10/3, 2013 at 8:51

The short answer is that you can't tell IN YOUR CODE for certain what the library will do, unless you are also involving library-implementation specific details. These would make the code completely unportable - even two different builds of the same compiler may change the internals of the library.

Of course, if portability isn't an issue, then using extern <type> ___use_sse2_mathfcns; and checking if it's the true would clearly work.

I expect that if the processor has SSE2 and you are using a modern enough library, it would use SSE2 wherever possible. But to say that for certain is a different matter.

If this is critical for your code, then implement your own transcendental functions and use those - that's the only way to guarantee the same result. Or, use some suitable inline assembler (or transcendental) code to calculate selected sin, cos, etc values, and compare those with the sin() and cos() functions provided by the library.

Tambourin answered 9/3, 2013 at 18:48 Comment(7)

extern int ___use_sse2_mathfcns gives me a linker error. I don't know where this is defined, it's just a name in the dissassembly. – Scapular 9/3, 2013 at 18:52

Stepping into sin() jumps to cmp dword ptr [___use_sse2_mathfcns (518BEEB8h)],0... I don't see any trace of this in the C headers either. – Scapular 9/3, 2013 at 19:7

It may be a private variable in that unit. Can you find what DLL it is in and see if it's exported? – Tambourin 9/3, 2013 at 19:31

msvcr110.dll (msvcr110d.dll in debug). Opening it with Dependency Walker, I don't see the symbol in question. – Scapular 9/3, 2013 at 19:44

In which case my original statement about "you can't really do this" applies. – Tambourin 9/3, 2013 at 19:44

Well, perhaps if I could determine under which conditions this variable is set, (for instance anytime SSE2 is supported), I could check for these conditions instead. – Scapular 9/3, 2013 at 19:52

I found a way - an officially supported one at that -, check out my answer :) – Scapular 9/3, 2013 at 20:47

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags